In the starting guide (last post) we went over the statements that allow us to access a SQLite3 database from Python. Here we're going to go over some recipes. These are code examples that will put those statements together to do something specific.
Loading a CSV into SQLite database
First thing we need to do is get data into the database. We could type it all in, but more commonly we'll have CSV files that we can load in.
This example is using two context managers. The first one is for opening the database. Inside that, we're opening the CSV file. Keep in mind it's the indentation that determines when each context ends.
# Import the modules
import csv
import sqlite3
# Define the database file name
db_file = "data\data2.db"
# Use the with statement to create a context manager
with sqlite3.connect(db_file) as conn:
# Create a cursor object to execute queries
cursor = conn.cursor()
# Create a table with the same columns as the CSV file
cursor.execute(
"""CREATE TABLE IF NOT EXISTS people
(name TEXT, age INTEGER, address TEXT)"""
)
# Open the CSV file
with open("data\data.csv", "r") as f:
# Create a csv reader object
reader = csv.reader(f)
# Skip the header row
next(reader)
# Insert each row into the table
for row in reader:
cursor.execute("""INSERT INTO people
(name, age, gender)
VALUES (?, ?, ?)""",
row)
conn.close()
Another option is instead of looping over the reader in the for row in reader
loop.
We could use the list
function and convert it to a list and use executemany
. The rest of the code would stay the same, just the for
loop would be removed since executemany
would handle looping over for us.
cursor.executemany("""INSERT INTO people
(name, age, gender)
VALUES (?, ?, ?)""",
list(reader))
Why one over the other? Looping over the reader ourselves gives us access to each row. We can then modify the values or even remove the row completely if needed.
The shortest method is to load the CSV file using Pandas and saving it directly into the SQLite3 database. I'm sure it could be cut down to just a couple of lines. I'll give an example in a later section.
Using SQLite with Pandas
Pandas can load data from many different sources. SQLite3 is as easy as almost any other data source. Pandas is able to read any SQL database that SQLAlchemy supports.
Reading a SQLite3 DB into Pandas
Passing the SQL statement and the SQLite3 connection object to Pandas' read_sql_query
method and allowing Pandas to handle bringing the data into a dataframe.
Reading the data is fairly straightforward. A SQL SELECT statement can be crafted to only pull in the data you'll need for your work. In the example we're pulling in the entire database but we could choose which columns or filter the records to a smaller selection.
# Import pandas and sqlite3 modules
import pandas as pd
import sqlite3
# Use the with statement to create a context manager
with sqlite3.connect("data/data.db") as conn:
# Create a cursor object to execute queries
cursor = conn.cursor()
# Write a SQL query to select data from the database
sql_query = "SELECT * FROM people"
# Use pandas.read_sql_query() to load the data into a DataFrame
df = pd.read_sql_query(sql_query, conn)
# Print the DataFrame
print(df)
Instead of using the execute
method from SQLite3 to execute the SQL statement, we're passing the query and the data base connection directly to Pandas' read_sql_query
and it handles converting it into a DataFrame.
Save Pandas Dataframe to SQLite3
Saving data from Pandas can be done in a few ways depending on what you need done.
Saving the entire Dataframe to a SQLite3 database is pretty quick. This will overwrite the SQLite3 data file.
# as promised here we're reading the CSV into a dataframe ๐ to save into the DB
df = pd.read_csv("data\data.csv")
conn = sqlite3.connect("example.db")
df.to_sql("my_table", conn, if_exists='replace', index=False)
conn.close()
The if_exists
option defaults to fail
meaning if the file already exists it will raise an ValueError
error. The replace
will either start a new file or, surprise, replace the file.
index=False
will not send Pandas' index.
Append Pandas Dataframe to SQLite3
if_exists
has another option, append
. This will add any new records to the existing database.
# Open connection to the SQLite3 DB
conn = sqlite3.connect("example.db")
# Create a new Dataframe
new_people = pd.DataFrame({'name': ['Russ'], 'age': [55], 'gender': ['M']})
# Have Pandas send the data and append it through the connection
new_people.to_sql('people', conn, if_exists='append', index=False)
conn.close()
When doing an append
the columns must be the same between the dataframe and database. The index=False
also must be the same.
Updating SQLite3 Database from Pandas
Updating existing records in a SQLite3 is a little more work than completely replacing the database, but not too much work.
There'll be many ways of doing this. I'll show you the way I might do it and you can adjust to fit the work you're doing.
First, we should bring in the rowid
when we bring in the data from SQLite3. This will be used to match the records from the dataframe and the SQLite3 database.
import pandas as pd
import sqlite3
conn = sqlite3.connect('data/pandas.db')
sql_command = "SELECT rowid, * FROM people"
df = pd.read_sql(sql_command, conn)
The only difference from before is also including the rowid
as part of the columns being pulled. Of course we could limit the columns we'll pull in and maybe even add a WHERE
clause to pull only certain records.
If you're thinking the index numbers of the dataframe could be used instead of the rowid
, keep in mind that if records were deleted from the database or if we reindex the dataframe, these will get out of sync. Pulling the rowid
will allow us to match each record in the dataframe to a record in the Database.
# whatever modifications and updates needed to make this new dataframe
df_modifed = df
for index, row in df_modifed.iterrows():
update_query = f"""UPDATE people
SET age = '{row['age']}'
WHERE rowid = {row['rowid']}
"""
conn.execute(update_query)
# Make sure these updates are committed
conn.commit()
The df_modified
is the updated version of the df
dataframe. Using the .iterrows()
method we'll iterate over the modified dataframe getting the index and the row. For each record in the dataframe will have its age updated in the database. We're assuming only the age
column was modified. We could add as many columns as needed to update the database properly. This doesn't check if the values are actually updated.
Because we're looping over the entire dataframe this process may take some time. One tactic would be to limit the number of rows that you'll go over. For instance, if you know that you only changed the records in certain zip codes, filter (mask) the dataframe to only those zip codes.
Update by Deleting
Doing an update is good for smaller updating, but if it's a lot of columns or a large number of rows, we may want to take a slightly different tactic.
Let's say we did a lot of updating on our customer called Sears
. We can delete all the records for that company.
delete_query = 'DELETE FROM people WHERE company="Sears";'
conn.execute(delete_query)
conn.commit()
And now we can append the updated records to the database. We're assuming here that df_updated_Sears
only contains the updated Sears records.
df_updated_Sears.to_sql('people', conn, if_exists='append', index=False)
conn.commit()
The advantage here is it'll be a lot quicker than iterating over the entire dataframe but the rowid
will be new for each record. Also, if one of the fields is automatically set on creation, it would show today instead of when the record was first put in. Just some things to think about.
Conclusion
I know some are thinking we should just use SQLAlchemy. And SQLAlchemy can handle a lot more than what we've shown here. There will be more setup to handle with SQLAlchemy so it's going to depend on how complicated your work is. What I'm showing here is how to do basic tasks without needing a lot of extra setup work.