The best way to learn anything is to actually try it. Let's see how OpenPyXL works and style a spreadsheet. The Dataset I've chosen is a CSV of Countries of the World (kaggle.com).
For this post I'm assuming you have a working knowledge of Python and Excel. Please read at the first post in this Series for an understanding of OpenPyXL.
OpenPyXL will need to be installed before we can use it.
pip install openpyxl
Structure of the code and post
Each section will be its own function called from inside the main function. To start we'll call the main
function and pass in the CSV data file's name and the XLSX to be created's name. The workbook will be passed into each function. I'm hoping breaking the code down this way will make it easier to follow along.
# import the needed items
import csv
import openpyxl
from openpyxl.styles import Font
from openpyxl.styles import Alignment
from openpyxl.styles import Border, Side
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl.worksheet.formula import ArrayFormula
from openpyxl.packaging.core import DocumentProperties
# This space is where all the functions will go ๐
def main(csv_file, xslx_file):
workbook = load_csv(csv_file)
# calling each function in turn
clean_wb(workbook)
style_header_row(workbook)
country_column(workbook)
format_numbers(workbook)
ranking_sheet(workbook)
summary_sheet(workbook)
# save the OpenPyXL Workbook
workbook.save(xslx_file)
if __name__ == "__main__":
main("data/countries of the world.csv", 'data/Countries Report.xlsx')
Loading a CSV into OpenPyXL
As mentioned in an earlier post, OpenPyXL doesn't access CSV files directly. With the CSV module we can read the CSV and load the OpenPyXL workbook object with the data. Check out my [post(]tongere.hashnode.dev/openpyxl-working-with-..) on loading CSV files into OpenPyXL for full details.
def load_csv(csv_file):
# Create a new OpenPyXL Workbook
wb= openpyxl.Workbook()
# Select the active sheet
ws = wb.active
# open the `csv_file`
with open(csv_file) as f:
# Read the CSV data
reader = csv.reader(f, delimiter=',')
# Load the data in the worksheet
for row in reader:
ws.append(row)
There are other options for loading a CSV and saving it as an Excel file. And you are welcome to change this function to use a different package. As long as it returns the OpenPyXL Worksheet object, the rest of the code will work just fine.
Workbook Maintenance
This section we're going to look at doing some maintenance on the workbook. We're not cleaning the data itself. Just changing some of the meta-data and trimming some of the white space.
By default, the properties of the Excel document will set the creator as 'openpyxl' and insert the creation and last modified dates but little else. By using a DocumentProperties
object we can add some extra meta-data. This is completely optional depending on if you're using tools that can search multiple Excel files.
Some other items you can set using DocumentProperties
::
- subject
- keywords
- category
- version
wb.active
will be the default page that opens when Excel opens the file. Normally this is the first sheet.
sheet.title
will let us change the title of the sheet (tab). The default title for the worksheet in this new workbook will be Sheet
. That's not very informative, plus we'll be adding more sheets.
sheet.columns
provides an iterable that we can loop over each column in the sheet. OpenPyXL will limit itself to only columns with data.
Using a pair of for
loops we'll iterate over each column and then each cell to access the cell itself.
if cell.value:
verifies that there is actually a value in the cell.cell.value.strip()
will remove any leading or following spaces.
Please Note: There is a chance that Excel marks cells as being used even if there are no values that we can see in them. This can happen when data is cleared out.
def clean_wb(wb):
# Set up some Excel Properties
workbook.properties = DocumentProperties(
creator = "Russ",
title="Country List",
lastModifiedBy="OpenPyXL")
# select the active page
sheet = wb.active
# Provide a good name for the sheet
sheet.title = "Country Data"
# iterate over every column
for column_cells in sheet.columns:
# iterate over evey cell in the column
for cell in column_cells:
# if the cell has data
if cell.value:
# strip spaces at the start and end
cell.value = cell.value.strip()
Header Row
For this section we're going to use two helper functions.
autofit a column's width
Setting the column width via code is done by setting the column's dimension to a width.
ws.column_dimensions['A'].width = '10'
The question is how to dynamically set it to the longest width? Excel doesn't have a formula or style setting the autofits the width of a column. What we can do is go through the data in the column and find the longest (max) length and set the width of the column to that.
Note: When we start working with formula's this won't quite work. The formula IS the value of the cell.
We'll be using get_column_letter(column_cells[0].column)
to return the column letter of cell reference column_cell[0]
.
def autofit(ws):
# Iterate over the columns in the worksheet
for column_cells in ws.columns:
# Determine the longest item in each column
max_length = max([len(cell.value or "") for cell in column_cells])
# I'm adding some extra since I like it a bit roomy
max_length = (max_length + 2) * 1.2
# set the width of the column to the max_length
ws.column_dimensions[
get_column_letter(column_cells[0].column)
].width = max_length
A little side detour. max()
will return the maximum value in a list. The list being used is created by a list comprehension. I'm spreading it out to be a easier to read.
[
len(cell.value or "") # this is the value to go in the list
for cell in column_cells # loop over the column
]
Formatting a Header Cell
Styling is handled on a cell-by-cell basis. For a style to be applied to a range of cells we'll need to loop over the cells. To make this a bit easier I'm going to use a function that will apply all the formatting rules to whatever cell we pass in. This won't make it faster, but using a defined function allows us to reuse the code for other tabs.
There are a couple of ways to define a color. I think the easiest method is with aRGB
using hexadecimal values. The colors are marked by setting the Red, Green and Blue values 00
(none) through FF
(max). 000000
would be black and FFFFFF
would be white. The a
is the transparency of the color and be in the front. With cell styles this isn't relevant, so we'll just prepend 00
in front.
For the borders we'll define some Side
objects with the border style and the color. These Side
objects will be used later when we apply the border to the cells.
The PatternFill
is the background pattern. There are a few options for the fill_type
including solid
, none
, darkTrellis
, and darkHorizontal
. Here we want a solid color.
The Font
object allows us to specify the font
, size
, color
, etc. If working with multiple spreadsheet reports, an idea might be to keep a Python Module with fonts already setup to use in all your reports.
The purpose of alignment
shouldn't be a surprise. This allows us to assign the alignment of the horizontal
, vertical
and if it'll wrap_text
.
With border we can specify each side using the Side
s we defined earlier.
def format_header_cell(cell):
# to make things a little clearer these are the colors I'm using for styling
blue = "000000FF"
lightBlue = "0099CCFF"
black = "00000000"
white = "00FFFFFF"
# Here we are defining the border style
thin_border = Side(border_style="thin", color=blue)
double_border = Side(border_style="double", color=lightBlue)
# apply a fill to the cell
cell.fill = PatternFill(start_color=blue, end_color=blue, fill_type="solid")
# apply the font styles
cell.font = Font(name="Tahoma", size=12, color=white, bold=True)
# apply an alignment
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
# apply the border
cell.border = Border(
top=double_border, left=thin_border, right=thin_border, bottom=double_border
)
Now that we've defined the helper functions let's get to work with styling the header row. Most functionality was actually handled in the helper functions.
We're selecting the active sheet (the first and only sheet). The for loop is iterating over the cells in the column_header_row
(1) and applies the formatting. Passing the worksheet to autofit
handles the column width.
Well, except for column H
. That we're going to manually decrease by 60%, cutting it almost in half. Since we turned wrap_text
on, it'll break the header into 2 lines. The heading was fairly long so to me it looks better wrapped. The report would have been fine. I'm making this change to make it more readable.
Also, this gives you another example of setting the column width. ๐
def style_header_row(wb):
# Select the active page
ws = wb.active
# The column header is in the first row
column_header_row = 1
# Select the column headers
for cell in ws[column_header_row]:
format_header_cell(cell)
# Use the user defined function to set the column width
autofit(ws)
# Adjust Column 'H' due to it's width
ws.column_dimensions["H"].width = ws.column_dimensions["H"].width * 0.6
Styling the Country Column
Here we're styling the Country list in column A
. For the most part everything here was already mentioned in the code for the header. In this function we won't be using a helper function, but adding one wouldn't be that hard if we wanted to use on other sheets or workbooks.
Something to point out is, since I already styled A1
I'm using a list slice to skip looping over the first cell in column A
. ws[country_column][1:]
this shows that the cell range can be sliced just like any other iterable.
def country_column(wb):
# Select the active page
ws = wb.active
# Countries are in the first column
country_column = "A"
# To make things a little clearer these are the colors I'm using for styling
babyBlue = "00CCFFFF"
# Here we are defining the border style
thick_border = Side(border_style="thick")
# Iterating over the Country column skipping the header row
for cell in ws[country_column][1:]:
# Formatting each cell
cell.fill = PatternFill(
start_color=babyBlue, end_color=babyBlue, fill_type="solid"
)
cell.alignment = Alignment(horizontal="right")
cell.border = Border(right=thick_border)
Formatting the number data
The data from the CSV was pulled in completely as text. For the formulas to work we'll need numbers. We also will want to format the numbers to either have 2 decimals or no decimals.
Just as the styling we'll need to iterate over the cells to apply the proper formatting.
for col in ws['C:T']:
we're looping over the columns C
through T
.
for cell in col[1:]:
we're looping over the cells in the column and skipping the 1 row. That's the header row, which isn't a number.
By checking each cell to verify it has a value and it's a string we can limit errors. Because we need to replace the ,
that is in the CSV with a .
and convert it to a float.
Now that we have numbers in the cells. We can apply the number format to the cell.
"#,##0"
will be used if it's one of the columns, ['C', 'D', 'I']"#,##0.00"
will be used for all other columns
If you're curious where these formatting strings came from, it's Excel's custom number formatting code.
def format_numbers(wb):
# Select the active page
ws = wb.active
# 2 types of numeric columns: no decimals and 2 decimals
number_col_0_dec = ['C', 'D', 'I']
# Iterate over the number columns
for col in ws['C:T']:
# Iterate over the cells in that column, skip the header row
for cell in col[1:]:
# Verify the cell has a value and is of instance `str`
if cell.value and isinstance(cell.value, str):
# replace the comma with a period
cell.value = float(cell.value.replace(',','.'))
# apply format
cell.number_format = "#,##0" if col in number_col_0_dec else "#,##0.00"
Ranking Sheet
In this sheet we'll start using some formulas. Each cell will be address individually to place the content.
wb.create_sheet(title="Rankings")
creates a new sheet with a title and inserts it at the end of the workbook. We could specify an index=1
to insert it in the first position. The sheet index count starts at 1. ๐
The actions for the three header cells are basically the same.
- Assign the text
- Use the helper function from earlier to style the cell
- Set a column width
For each row we're assigning the cell values the same type of pattern.
- Column 1: Assign text
- Column 2: Excel Formula to retrieve the Country name
- Column 3: Excel Formula to retrieve the numeric value
The formulas are a straight Excel formula. How they work is outside this post, but there is a couple of resources at the end of the post that can help.
For column 3, a number format is being applied. That's the only styling we're doing here.
def ranking_sheet(wb):
# add a summary sheet
rankings_sheet = wb.create_sheet(title="Rankings")
# Header Row
rankings_sheet['A1'] = "Ranking"
format_header_cell(rankings_sheet['A1'])
rankings_sheet.column_dimensions['A'].width = 25
format_header_cell(rankings_sheet['B1'])
rankings_sheet['B1'] = "Country"
rankings_sheet.column_dimensions['B'].width = 20
format_header_cell(rankings_sheet['C1'])
rankings_sheet['C1'] = "Values"
rankings_sheet.column_dimensions['C'].width = 15
# Largest Population
rankings_sheet['A2'] = "Largest population"
rankings_sheet['B2'] = "=INDEX('Country Data'!A2:A229, MATCH(MAX('Country Data'!C2:C228), 'Country Data'!C2:C228, 0))"
rankings_sheet['C2'] = "=MAX('Country Data'!C2:C228)"
rankings_sheet['C2'].number_format = "#,###"
# Largest Population
rankings_sheet['A3'] = "Smallest population"
rankings_sheet['B3'] = "=INDEX('Country Data'!A2:A229, MATCH(MIN('Country Data'!C2:C228), 'Country Data'!C2:C228, 0))"
rankings_sheet['C3'] = "=MIN('Country Data'!C2:C228)"
rankings_sheet['C3'].number_format = "#,###"
# Largest Area
rankings_sheet['A4'] = "Largest area"
rankings_sheet['B4'] = "=INDEX('Country Data'!A2:A229, MATCH(MAX('Country Data'!D2:D228), 'Country Data'!D2:D228, 0))"
rankings_sheet['C4'] = "=MAX('Country Data'!D2:D228)"
rankings_sheet['C4'].number_format = "#,###"
# Smallest Area
rankings_sheet['A5'] = "Smallest area"
rankings_sheet['B5'] = "=INDEX('Country Data'!A2:A229, MATCH(MIN('Country Data'!D2:D228), 'Country Data'!D2:D228, 0))"
rankings_sheet['C5'] = "=MIN('Country Data'!D2:D228)"
rankings_sheet['C5'].number_format = "#,###"
Summary Sheet
The Ranking Sheet we went row by row. For the Summary Sheet we're going to go column by column.
Each column has 4 steps:
- Assign text to first cell
- Style the header cell using the helper function
- Set the formula for the column
- Set the width of the column
As always, we'll skip parts we've addressed earlier.
The Region List in column 'A' is using the UNIQUE
Excel formula that will spill the list down the column. To enable the spilling, we need to use ArrayFormula
. This is an OpenPyXL function that works with Dynamic lists in Excel.
Placing the formula as this, only will present the first item in this cell. We could manually adjust the formula will get it to spill into the next cells.
summary_sheet["A2"].value = "=_xlfn.UNIQUE(_xlfn.SORT('Country Data'!B2:B228))"
Using the ArrayFormula
will handle this adjustment for us. We need to pass to the range that the formula will cover and the formula itself.
summary_sheet["A2"] = ArrayFormula(
"A2:A12", "=_xlfn.UNIQUE(_xlfn.SORT('Country Data'!B2:B228))"
)
Something important you'll notice is the _xlfn.
prefix in from the Excel Function used; UNIQUE
and SORT
. This is directing that this is an Excel Function instead of a user created function. Any function in the initial specification doesn't need to be prefixed with _xlfn.
to work.
How do we know if it needs to have it prefixed? We can check (example next), if it returns False
the prefix is required.
>>> from openpyxl.utils import FORMULAE
>>> "UNIQUE" in FORMULAE
False
When the formula is viewed in Excel, it'll be in each cell in the range. And will be surrounded by { }
. This is a sign it's a dynamic list. Without these { }
each cell will only contain the first element in the list. Go ahead and test it out to get a good idea of how it works.
Here's the function summary_sheet
function:
def summary_sheet(wb):
# add a summary sheet
summary_sheet = wb.create_sheet(title="Summary")
# Region header
summary_sheet["A1"].value = "Region"
format_header_cell(summary_sheet["A1"])
# per OpenPyXL's documention, any formula that wasn't in the initial specification
# must be prefixed with `_xlfn.`
summary_sheet["A2"] = ArrayFormula(
"A2:A12", "=_xlfn.UNIQUE(_xlfn.SORT('Country Data'!B2:B228))"
)
summary_sheet.column_dimensions["A"].width = 22
# Count of Country
summary_sheet["B1"].value = "# of Countries"
format_header_cell(summary_sheet["B1"])
for col in summary_sheet["B2:B12"]:
for cell in col:
cell.value = f"=COUNTIF('Country Data'!$B$2:$B$228, A{cell.row})"
cell.number_format = "#,###"
summary_sheet.column_dimensions["B"].width = 15
# Population
summary_sheet["C1"].value = "Population"
format_header_cell(summary_sheet["C1"])
for col in summary_sheet["C2:C12"]:
for cell in col:
cell.value = f"=SUMIF('Country Data'!$B$2:$B$228, A{cell.row}, 'Country Data'!$C$2:$C$228)"
cell.number_format = "#,###"
summary_sheet.column_dimensions["C"].width = 15
# Area
summary_sheet["D1"].value = "Area"
format_header_cell(summary_sheet["D1"])
for col in summary_sheet["D2:D12"]:
for cell in col:
cell.value = f"=SUMIF('Country Data'!$B$2:$B$228, A{cell.row}, 'Country Data'!$D$2:$CD$228)"
cell.number_format = "#,###"
summary_sheet.column_dimensions["D"].width = 15
Conclusion
We've covered a lot of material. If you have any questions or suggestions, please leave a comment.
I highly recommend going through and running the code. Add some extra calculations. Maybe even add some styling. Best way to learn is to practice.
Resources
Leila Gharani - YouTube: How to use Excel Index Match
Leila Gharani - YouTube: Basic Excel Formulas and Functions