Python - OpenPyXL - Styling a Spreadsheet

Python - OpenPyXL - Styling a Spreadsheet

ยท

13 min read

The best way to learn anything is to actually try it. Let's see how OpenPyXL works and style a spreadsheet. The Dataset I've chosen is a CSV of Countries of the World (kaggle.com).

For this post I'm assuming you have a working knowledge of Python and Excel. Please read at the first post in this Series for an understanding of OpenPyXL.

OpenPyXL will need to be installed before we can use it.

pip install openpyxl

Structure of the code and post

Each section will be its own function called from inside the main function. To start we'll call the main function and pass in the CSV data file's name and the XLSX to be created's name. The workbook will be passed into each function. I'm hoping breaking the code down this way will make it easier to follow along.

# import the needed items
import csv
import openpyxl
from openpyxl.styles import Font
from openpyxl.styles import Alignment
from openpyxl.styles import Border, Side
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl.worksheet.formula import ArrayFormula
from openpyxl.packaging.core import DocumentProperties

# This space is where all the functions will go ๐Ÿ˜Ž

def main(csv_file, xslx_file):

    workbook = load_csv(csv_file)

    # calling each function in turn
    clean_wb(workbook)
    style_header_row(workbook)
    country_column(workbook)
    format_numbers(workbook)
    ranking_sheet(workbook)
    summary_sheet(workbook)

    # save the OpenPyXL Workbook
    workbook.save(xslx_file)

if __name__ == "__main__":
    main("data/countries of the world.csv", 'data/Countries Report.xlsx')

Loading a CSV into OpenPyXL

As mentioned in an earlier post, OpenPyXL doesn't access CSV files directly. With the CSV module we can read the CSV and load the OpenPyXL workbook object with the data. Check out my [post(]tongere.hashnode.dev/openpyxl-working-with-..) on loading CSV files into OpenPyXL for full details.

def load_csv(csv_file):

    # Create a new OpenPyXL Workbook
    wb= openpyxl.Workbook()

    # Select the active sheet
    ws = wb.active

    # open the `csv_file`
    with open(csv_file) as f:

        # Read the CSV data
        reader = csv.reader(f, delimiter=',')

        # Load the data in the worksheet
        for row in reader:
            ws.append(row)

There are other options for loading a CSV and saving it as an Excel file. And you are welcome to change this function to use a different package. As long as it returns the OpenPyXL Worksheet object, the rest of the code will work just fine.

Workbook Maintenance

This section we're going to look at doing some maintenance on the workbook. We're not cleaning the data itself. Just changing some of the meta-data and trimming some of the white space.

By default, the properties of the Excel document will set the creator as 'openpyxl' and insert the creation and last modified dates but little else. By using a DocumentProperties object we can add some extra meta-data. This is completely optional depending on if you're using tools that can search multiple Excel files.

Some other items you can set using DocumentProperties::

  • subject
  • keywords
  • category
  • version

wb.active will be the default page that opens when Excel opens the file. Normally this is the first sheet.

sheet.title will let us change the title of the sheet (tab). The default title for the worksheet in this new workbook will be Sheet. That's not very informative, plus we'll be adding more sheets.

sheet.columns provides an iterable that we can loop over each column in the sheet. OpenPyXL will limit itself to only columns with data.

Using a pair of for loops we'll iterate over each column and then each cell to access the cell itself.

  • if cell.value: verifies that there is actually a value in the cell.
  • cell.value.strip() will remove any leading or following spaces.

Please Note: There is a chance that Excel marks cells as being used even if there are no values that we can see in them. This can happen when data is cleared out.

def clean_wb(wb):

    # Set up some Excel Properties
    workbook.properties = DocumentProperties(
                     creator = "Russ", 
                     title="Country List", 
                     lastModifiedBy="OpenPyXL")

    # select the active page
    sheet = wb.active

    # Provide a good name for the sheet
    sheet.title = "Country Data"

    # iterate over every column
    for column_cells in sheet.columns:

        # iterate over evey cell in the column
        for cell in column_cells:

            # if the cell has data
            if cell.value:

                # strip spaces at the start and end
                cell.value = cell.value.strip()

Header Row

For this section we're going to use two helper functions.

autofit a column's width

Setting the column width via code is done by setting the column's dimension to a width.

ws.column_dimensions['A'].width = '10'

The question is how to dynamically set it to the longest width? Excel doesn't have a formula or style setting the autofits the width of a column. What we can do is go through the data in the column and find the longest (max) length and set the width of the column to that.

Note: When we start working with formula's this won't quite work. The formula IS the value of the cell.

We'll be using get_column_letter(column_cells[0].column) to return the column letter of cell reference column_cell[0].

def autofit(ws):

    # Iterate over the columns in the worksheet
    for column_cells in ws.columns:

        # Determine the longest item in each column
        max_length = max([len(cell.value or "") for cell in column_cells])

        # I'm adding some extra since I like it a bit roomy
        max_length = (max_length + 2) * 1.2

        # set the width of the column to the max_length
        ws.column_dimensions[
            get_column_letter(column_cells[0].column)
        ].width = max_length

A little side detour. max()will return the maximum value in a list. The list being used is created by a list comprehension. I'm spreading it out to be a easier to read.

[
    len(cell.value or "")            # this is the value to go in the list
    for cell in column_cells      # loop over the column
]

Formatting a Header Cell

Styling is handled on a cell-by-cell basis. For a style to be applied to a range of cells we'll need to loop over the cells. To make this a bit easier I'm going to use a function that will apply all the formatting rules to whatever cell we pass in. This won't make it faster, but using a defined function allows us to reuse the code for other tabs.

There are a couple of ways to define a color. I think the easiest method is with aRGB using hexadecimal values. The colors are marked by setting the Red, Green and Blue values 00 (none) through FF (max). 000000 would be black and FFFFFF would be white. The a is the transparency of the color and be in the front. With cell styles this isn't relevant, so we'll just prepend 00 in front.

For the borders we'll define some Side objects with the border style and the color. These Side objects will be used later when we apply the border to the cells.

The PatternFill is the background pattern. There are a few options for the fill_type including solid, none, darkTrellis, and darkHorizontal. Here we want a solid color.

The Font object allows us to specify the font, size, color, etc. If working with multiple spreadsheet reports, an idea might be to keep a Python Module with fonts already setup to use in all your reports.

The purpose of alignment shouldn't be a surprise. This allows us to assign the alignment of the horizontal, vertical and if it'll wrap_text.

With border we can specify each side using the Sides we defined earlier.

def format_header_cell(cell):

    # to make things a little clearer these are the colors I'm using for styling
    blue = "000000FF"
    lightBlue = "0099CCFF"
    black = "00000000"
    white = "00FFFFFF"

    # Here we are defining the border style
    thin_border = Side(border_style="thin", color=blue)
    double_border = Side(border_style="double", color=lightBlue)

    # apply a fill to the cell
    cell.fill = PatternFill(start_color=blue, end_color=blue, fill_type="solid")

    # apply the font styles
    cell.font = Font(name="Tahoma", size=12, color=white, bold=True)

    # apply an alignment
    cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)

    # apply the border
    cell.border = Border(
        top=double_border, left=thin_border, right=thin_border, bottom=double_border
    )

Now that we've defined the helper functions let's get to work with styling the header row. Most functionality was actually handled in the helper functions.

We're selecting the active sheet (the first and only sheet). The for loop is iterating over the cells in the column_header_row (1) and applies the formatting. Passing the worksheet to autofit handles the column width.

Well, except for column H. That we're going to manually decrease by 60%, cutting it almost in half. Since we turned wrap_text on, it'll break the header into 2 lines. The heading was fairly long so to me it looks better wrapped. The report would have been fine. I'm making this change to make it more readable.

Also, this gives you another example of setting the column width. ๐Ÿ˜€

def style_header_row(wb):

    # Select the active page
    ws = wb.active

    # The column header is in the first row
    column_header_row = 1

    # Select the column headers
    for cell in ws[column_header_row]:
        format_header_cell(cell)

    # Use the user defined function to set the column width
    autofit(ws)

    # Adjust Column 'H' due to it's width
    ws.column_dimensions["H"].width = ws.column_dimensions["H"].width * 0.6

Styling the Country Column

Here we're styling the Country list in column A. For the most part everything here was already mentioned in the code for the header. In this function we won't be using a helper function, but adding one wouldn't be that hard if we wanted to use on other sheets or workbooks.

Something to point out is, since I already styled A1 I'm using a list slice to skip looping over the first cell in column A. ws[country_column][1:] this shows that the cell range can be sliced just like any other iterable.

def country_column(wb):

    # Select the active page
    ws = wb.active

    # Countries are in the first column
    country_column = "A"

    # To make things a little clearer these are the colors I'm using for styling
    babyBlue = "00CCFFFF"

    # Here we are defining the border style
    thick_border = Side(border_style="thick")

    # Iterating over the Country column skipping the header row
    for cell in ws[country_column][1:]:

       # Formatting each cell 
        cell.fill = PatternFill(
            start_color=babyBlue, end_color=babyBlue, fill_type="solid"
        )
        cell.alignment = Alignment(horizontal="right")
        cell.border = Border(right=thick_border)

Formatting the number data

The data from the CSV was pulled in completely as text. For the formulas to work we'll need numbers. We also will want to format the numbers to either have 2 decimals or no decimals.

Just as the styling we'll need to iterate over the cells to apply the proper formatting.

for col in ws['C:T']: we're looping over the columns C through T.

for cell in col[1:]: we're looping over the cells in the column and skipping the 1 row. That's the header row, which isn't a number.

By checking each cell to verify it has a value and it's a string we can limit errors. Because we need to replace the , that is in the CSV with a . and convert it to a float.

Now that we have numbers in the cells. We can apply the number format to the cell.

  • "#,##0" will be used if it's one of the columns, ['C', 'D', 'I']
  • "#,##0.00" will be used for all other columns

If you're curious where these formatting strings came from, it's Excel's custom number formatting code.

def format_numbers(wb):

    # Select the active page
    ws = wb.active

    # 2 types of numeric columns: no decimals and 2 decimals
    number_col_0_dec = ['C', 'D', 'I']

    # Iterate over the number columns
    for col in ws['C:T']:

        # Iterate over the cells in that column, skip the header row
        for cell in col[1:]:

            # Verify the cell has a value and is of instance `str`
            if cell.value and isinstance(cell.value, str):

                # replace the comma with a period
                cell.value = float(cell.value.replace(',','.'))

            # apply format
            cell.number_format = "#,##0" if col in number_col_0_dec else "#,##0.00"

Ranking Sheet

In this sheet we'll start using some formulas. Each cell will be address individually to place the content.

wb.create_sheet(title="Rankings") creates a new sheet with a title and inserts it at the end of the workbook. We could specify an index=1 to insert it in the first position. The sheet index count starts at 1. ๐Ÿ˜

The actions for the three header cells are basically the same.

  • Assign the text
  • Use the helper function from earlier to style the cell
  • Set a column width

For each row we're assigning the cell values the same type of pattern.

  • Column 1: Assign text
  • Column 2: Excel Formula to retrieve the Country name
  • Column 3: Excel Formula to retrieve the numeric value

The formulas are a straight Excel formula. How they work is outside this post, but there is a couple of resources at the end of the post that can help.

For column 3, a number format is being applied. That's the only styling we're doing here.

def ranking_sheet(wb):

    # add a summary sheet
    rankings_sheet = wb.create_sheet(title="Rankings")

    # Header Row
    rankings_sheet['A1'] = "Ranking"
    format_header_cell(rankings_sheet['A1'])
    rankings_sheet.column_dimensions['A'].width = 25

    format_header_cell(rankings_sheet['B1'])
    rankings_sheet['B1'] = "Country"
    rankings_sheet.column_dimensions['B'].width = 20

    format_header_cell(rankings_sheet['C1'])
    rankings_sheet['C1'] = "Values"
    rankings_sheet.column_dimensions['C'].width = 15

    # Largest Population
    rankings_sheet['A2'] = "Largest population"
    rankings_sheet['B2'] = "=INDEX('Country Data'!A2:A229, MATCH(MAX('Country Data'!C2:C228), 'Country Data'!C2:C228, 0))"
    rankings_sheet['C2'] = "=MAX('Country Data'!C2:C228)"
    rankings_sheet['C2'].number_format = "#,###"

    # Largest Population
    rankings_sheet['A3'] = "Smallest population"
    rankings_sheet['B3'] = "=INDEX('Country Data'!A2:A229, MATCH(MIN('Country Data'!C2:C228), 'Country Data'!C2:C228, 0))"
    rankings_sheet['C3'] = "=MIN('Country Data'!C2:C228)"
    rankings_sheet['C3'].number_format = "#,###"

    # Largest Area
    rankings_sheet['A4'] = "Largest area"
    rankings_sheet['B4'] = "=INDEX('Country Data'!A2:A229, MATCH(MAX('Country Data'!D2:D228), 'Country Data'!D2:D228, 0))"
    rankings_sheet['C4'] = "=MAX('Country Data'!D2:D228)"
    rankings_sheet['C4'].number_format = "#,###"

    # Smallest Area
    rankings_sheet['A5'] = "Smallest area"
    rankings_sheet['B5'] = "=INDEX('Country Data'!A2:A229, MATCH(MIN('Country Data'!D2:D228), 'Country Data'!D2:D228, 0))"
    rankings_sheet['C5'] = "=MIN('Country Data'!D2:D228)"
    rankings_sheet['C5'].number_format = "#,###"

Summary Sheet

The Ranking Sheet we went row by row. For the Summary Sheet we're going to go column by column.

Each column has 4 steps:

  • Assign text to first cell
  • Style the header cell using the helper function
  • Set the formula for the column
  • Set the width of the column

As always, we'll skip parts we've addressed earlier.

The Region List in column 'A' is using the UNIQUE Excel formula that will spill the list down the column. To enable the spilling, we need to use ArrayFormula. This is an OpenPyXL function that works with Dynamic lists in Excel.

Placing the formula as this, only will present the first item in this cell. We could manually adjust the formula will get it to spill into the next cells.

summary_sheet["A2"].value = "=_xlfn.UNIQUE(_xlfn.SORT('Country Data'!B2:B228))"

Using the ArrayFormula will handle this adjustment for us. We need to pass to the range that the formula will cover and the formula itself.

    summary_sheet["A2"] = ArrayFormula(
        "A2:A12", "=_xlfn.UNIQUE(_xlfn.SORT('Country Data'!B2:B228))"
    )

Something important you'll notice is the _xlfn. prefix in from the Excel Function used; UNIQUE and SORT. This is directing that this is an Excel Function instead of a user created function. Any function in the initial specification doesn't need to be prefixed with _xlfn. to work.

How do we know if it needs to have it prefixed? We can check (example next), if it returns False the prefix is required.

>>> from openpyxl.utils import FORMULAE
>>> "UNIQUE" in FORMULAE
False

When the formula is viewed in Excel, it'll be in each cell in the range. And will be surrounded by { }. This is a sign it's a dynamic list. Without these { } each cell will only contain the first element in the list. Go ahead and test it out to get a good idea of how it works.

Here's the function summary_sheet function:

def summary_sheet(wb):

    # add a summary sheet
    summary_sheet = wb.create_sheet(title="Summary")

    # Region header
    summary_sheet["A1"].value = "Region"
    format_header_cell(summary_sheet["A1"])

    # per OpenPyXL's documention, any formula that wasn't in the initial specification
    # must be prefixed with `_xlfn.`
    summary_sheet["A2"] = ArrayFormula(
        "A2:A12", "=_xlfn.UNIQUE(_xlfn.SORT('Country Data'!B2:B228))"
    )

    summary_sheet.column_dimensions["A"].width = 22

    # Count of Country
    summary_sheet["B1"].value = "# of Countries"
    format_header_cell(summary_sheet["B1"])

    for col in summary_sheet["B2:B12"]:
        for cell in col:
            cell.value = f"=COUNTIF('Country Data'!$B$2:$B$228, A{cell.row})"
            cell.number_format = "#,###"

    summary_sheet.column_dimensions["B"].width = 15

    # Population
    summary_sheet["C1"].value = "Population"
    format_header_cell(summary_sheet["C1"])

    for col in summary_sheet["C2:C12"]:
        for cell in col:
            cell.value = f"=SUMIF('Country Data'!$B$2:$B$228, A{cell.row}, 'Country Data'!$C$2:$C$228)"
            cell.number_format = "#,###"

    summary_sheet.column_dimensions["C"].width = 15

    # Area
    summary_sheet["D1"].value = "Area"
    format_header_cell(summary_sheet["D1"])

    for col in summary_sheet["D2:D12"]:
        for cell in col:
            cell.value = f"=SUMIF('Country Data'!$B$2:$B$228, A{cell.row}, 'Country Data'!$D$2:$CD$228)"
            cell.number_format = "#,###"

    summary_sheet.column_dimensions["D"].width = 15

Conclusion

We've covered a lot of material. If you have any questions or suggestions, please leave a comment.

I highly recommend going through and running the code. Add some extra calculations. Maybe even add some styling. Best way to learn is to practice.

Resources

Leila Gharani - YouTube: How to use Excel Index Match

Leila Gharani - YouTube: Basic Excel Formulas and Functions

OpenPyXL Docs - Formulas

RussEby Repo containing code and data files

kaggle.com

Did you find this article valuable?

Support Russ Eby by becoming a sponsor. Any amount is appreciated!