Python - SQLite3 with CSV and Pandas

Python - SQLite3 with CSV and Pandas

ยท

7 min read

In the starting guide (last post) we went over the statements that allow us to access a SQLite3 database from Python. Here we're going to go over some recipes. These are code examples that will put those statements together to do something specific.

Loading a CSV into SQLite database

First thing we need to do is get data into the database. We could type it all in, but more commonly we'll have CSV files that we can load in.

This example is using two context managers. The first one is for opening the database. Inside that, we're opening the CSV file. Keep in mind it's the indentation that determines when each context ends.

# Import the modules
import csv
import sqlite3

# Define the database file name
db_file = "data\data2.db"

# Use the with statement to create a context manager
with sqlite3.connect(db_file) as conn:
    # Create a cursor object to execute queries
    cursor = conn.cursor()

    # Create a table with the same columns as the CSV file
    cursor.execute(
        """CREATE TABLE IF NOT EXISTS people 
        (name TEXT, age INTEGER, address TEXT)"""
    )

    # Open the CSV file
    with open("data\data.csv", "r") as f:   
        # Create a csv reader object
        reader = csv.reader(f)

        # Skip the header row
        next(reader)

        # Insert each row into the table
        for row in reader:
            cursor.execute("""INSERT INTO people 
                        (name, age, gender) 
                        VALUES (?, ?, ?)""", 
                        row)

conn.close()

Another option is instead of looping over the reader in the for row in reader loop.

We could use the list function and convert it to a list and use executemany. The rest of the code would stay the same, just the for loop would be removed since executemany would handle looping over for us.

        cursor.executemany("""INSERT INTO people 
                    (name, age, gender) 
                    VALUES (?, ?, ?)""", 
                    list(reader))

Why one over the other? Looping over the reader ourselves gives us access to each row. We can then modify the values or even remove the row completely if needed.

The shortest method is to load the CSV file using Pandas and saving it directly into the SQLite3 database. I'm sure it could be cut down to just a couple of lines. I'll give an example in a later section.

Using SQLite with Pandas

Pandas can load data from many different sources. SQLite3 is as easy as almost any other data source. Pandas is able to read any SQL database that SQLAlchemy supports.

Reading a SQLite3 DB into Pandas

Passing the SQL statement and the SQLite3 connection object to Pandas' read_sql_query method and allowing Pandas to handle bringing the data into a dataframe.

Reading the data is fairly straightforward. A SQL SELECT statement can be crafted to only pull in the data you'll need for your work. In the example we're pulling in the entire database but we could choose which columns or filter the records to a smaller selection.

# Import pandas and sqlite3 modules
import pandas as pd
import sqlite3

# Use the with statement to create a context manager
with sqlite3.connect("data/data.db") as conn:
    # Create a cursor object to execute queries
    cursor = conn.cursor()

    # Write a SQL query to select data from the database
    sql_query = "SELECT * FROM people"

    # Use pandas.read_sql_query() to load the data into a DataFrame
    df = pd.read_sql_query(sql_query, conn)

# Print the DataFrame
print(df)

Instead of using the execute method from SQLite3 to execute the SQL statement, we're passing the query and the data base connection directly to Pandas' read_sql_query and it handles converting it into a DataFrame.

Save Pandas Dataframe to SQLite3

Saving data from Pandas can be done in a few ways depending on what you need done.

Saving the entire Dataframe to a SQLite3 database is pretty quick. This will overwrite the SQLite3 data file.

# as promised here we're reading the CSV into a dataframe ๐Ÿ˜€ to save into the DB
df = pd.read_csv("data\data.csv")

conn = sqlite3.connect("example.db")

df.to_sql("my_table", conn, if_exists='replace', index=False)

conn.close()

The if_exists option defaults to fail meaning if the file already exists it will raise an ValueError error. The replace will either start a new file or, surprise, replace the file.

index=False will not send Pandas' index.

Append Pandas Dataframe to SQLite3

if_exists has another option, append. This will add any new records to the existing database.

# Open connection to the SQLite3 DB
conn = sqlite3.connect("example.db")

# Create a new Dataframe
new_people = pd.DataFrame({'name': ['Russ'], 'age': [55], 'gender': ['M']})

# Have Pandas send the data and append it through the connection
new_people.to_sql('people', conn, if_exists='append', index=False)

conn.close()

When doing an append the columns must be the same between the dataframe and database. The index=False also must be the same.

Updating SQLite3 Database from Pandas

Updating existing records in a SQLite3 is a little more work than completely replacing the database, but not too much work.

There'll be many ways of doing this. I'll show you the way I might do it and you can adjust to fit the work you're doing.

First, we should bring in the rowid when we bring in the data from SQLite3. This will be used to match the records from the dataframe and the SQLite3 database.

import pandas as pd
import sqlite3

conn = sqlite3.connect('data/pandas.db')

sql_command = "SELECT rowid, * FROM people"

df = pd.read_sql(sql_command, conn)

The only difference from before is also including the rowid as part of the columns being pulled. Of course we could limit the columns we'll pull in and maybe even add a WHERE clause to pull only certain records.

If you're thinking the index numbers of the dataframe could be used instead of the rowid, keep in mind that if records were deleted from the database or if we reindex the dataframe, these will get out of sync. Pulling the rowid will allow us to match each record in the dataframe to a record in the Database.

# whatever modifications and updates needed to make this new dataframe
df_modifed = df            

for index, row in df_modifed.iterrows():
        update_query = f"""UPDATE people 
                                        SET age = '{row['age']}'
                                        WHERE rowid = {row['rowid']}
                                    """
        conn.execute(update_query)

# Make sure these updates are committed
conn.commit()

The df_modified is the updated version of the df dataframe. Using the .iterrows() method we'll iterate over the modified dataframe getting the index and the row. For each record in the dataframe will have its age updated in the database. We're assuming only the age column was modified. We could add as many columns as needed to update the database properly. This doesn't check if the values are actually updated.

Because we're looping over the entire dataframe this process may take some time. One tactic would be to limit the number of rows that you'll go over. For instance, if you know that you only changed the records in certain zip codes, filter (mask) the dataframe to only those zip codes.

Update by Deleting

Doing an update is good for smaller updating, but if it's a lot of columns or a large number of rows, we may want to take a slightly different tactic.

Let's say we did a lot of updating on our customer called Sears. We can delete all the records for that company.

delete_query = 'DELETE FROM people WHERE company="Sears";'

conn.execute(delete_query)

conn.commit()

And now we can append the updated records to the database. We're assuming here that df_updated_Sears only contains the updated Sears records.

df_updated_Sears.to_sql('people', conn, if_exists='append', index=False)

conn.commit()

The advantage here is it'll be a lot quicker than iterating over the entire dataframe but the rowid will be new for each record. Also, if one of the fields is automatically set on creation, it would show today instead of when the record was first put in. Just some things to think about.

Conclusion

I know some are thinking we should just use SQLAlchemy. And SQLAlchemy can handle a lot more than what we've shown here. There will be more setup to handle with SQLAlchemy so it's going to depend on how complicated your work is. What I'm showing here is how to do basic tasks without needing a lot of extra setup work.

Resources

Pandas - DataFrame.to_sql

Pandas - read_sql_query

Did you find this article valuable?

Support Russ Eby by becoming a sponsor. Any amount is appreciated!