Parsing PDFs in Python with Tika

A few months ago, one of my friends asked me if I could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few Python-based PDF parsers and decided to try Tika, which is a port of Apache Tika.  Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided to write a blog post about parsing PDFs with Tika.

tika

California Budget PDFs

To demonstrate parsing PDFs with Tika, I knew I’d need some PDFs. I was thinking about which ones to use and remembered a blog post I’d read on scraping budget data from a government website. Governments also provide data in PDF format, so I decided it would be helpful to demonstrate how to parse data from PDFs available on a government website. This way, with these two blog posts, you have examples of acquiring government data, even if it’s embedded in HTML or PDFs. The three PDFs we’ll parse in this post are:

2015-16 State of California Enacted Budget Summary Charts
2014-15 State of California Enacted Budget Summary Charts
2013-14 State of California Enacted Budget Summary Charts

ca_budget

Each of these PDFs contains several tables that summarize total revenues and expenditures, general fund revenues and expenditures, expenditures by agency, and revenue sources. For this post, let’s extract the data on expenditures by agency and revenue sources. In the 2015-16 Budget PDF, the titles for these two tables are:

2015-16 Total State Expenditures by Agency

expenditures

2015-16 Revenue Sources

revenues

To follow along with the rest of this tutorial you’ll need to download the three PDFs and ensure you’ve installed Tika. You can download the three PDFs here:

http://www.ebudget.ca.gov/2015-16/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2014-15/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2013-14/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

You can install Tika by running the following command in a Terminal window:

pip install --user tika

IPython

Before we dive into parsing all of the PDFs, let’s use one of the PDFs, 2015-16CABudgetSummaryCharts.pdf, to become familiar with Tika and its output. We can use IPython to explore Tika’s output interactively:

ipython

from tika import parser

parsedPDF = parser.from_file("2015-16CABudgetSummaryCharts.pdf")

You can type the name of the variable, a period, and then hit tab to view a list of all of the methods available to you:

parsedPDF.

ipython1

There are many options related to keys and values, so it appears the variable contains a dictionary. Let’s view the dictionary’s keys:

parsedPDF.viewkeys()

parsedPDF.keys()

The dictionary’s keys are metadata and content. Let’s take a look at the values associated with these keys:

parsedPDF["metadata"]

The value associated with the key “metadata” is another dictionary. As you’d expect based on the name of the key, its key-value pairs provide metadata about the parsed PDF.

ipython2

Now let’s take a look at the value associated with “content”.

parsedPDF["content"]

The value associated with the key “content” is a string. As you’d expect, the string contains the PDF’s text content.

ipython3

Now that we know the types of objects and values Tika provides to us, let’s write a Python script to parse all three of the PDFs. The script will iterate over the PDF files in a folder and, for each one, parse the text from the file, select the lines of text associated with the expenditures by agency and revenue sources tables, convert each of these selected lines of text into a Pandas DataFrame, display the DataFrame, and create and save a horizontal bar plot of the totals column for the expenditures and revenues. So, after you run this script, you’ll have six new plots, one for revenues and one for expenditures for each of the three PDF files, in the folder in which you ran the script.

Python Script

To parse the three PDFs, create a new Python script named parse_pdfs_with_tika.py and add the following lines of code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import glob
import os
import re
import sys
import pandas as pd
import matplotlib
matplotlib.use('AGG')
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'

from tika import parser

input_path = sys.argv[1]

def create_df(pdf_content, content_pattern, line_pattern, column_headings):
    """Create a Pandas DataFrame from lines of text in a PDF.

    Arguments:
    pdf_content -- all of the text Tika parses from the PDF
    content_pattern -- a pattern that identifies the set of lines
    that will become rows in the DataFrame
    line_pattern -- a pattern that separates the agency name or revenue source
    from the dollar values in the line
    column_headings -- the list of column headings for the DataFrame
    """
    list_of_line_items = []
    # Grab all of the lines of text that match the pattern in content_pattern
    content_match = re.search(content_pattern, pdf_content, re.DOTALL)
    # group(1): only keep the lines between the parentheses in the pattern
    content_match = content_match.group(1)
    # Split on newlines to create a sequence of strings
    content_match = content_match.split('\n')
    # Iterate over each line
    for item in content_match:
        # Create a list to hold the values in the line we want to retain
        line_items = []
        # Use line_pattern to separate the agency name or revenue source
        # from the dollar values in the line
        line_match = re.search(line_pattern, item, re.I)
        # Grab the agency name or revenue source, strip whitespace, and remove commas
        # group(1): the value inside the first set of parentheses in line_pattern
        agency = line_match.group(1).strip().replace(',', '')
        # Grab the dollar values, strip whitespace, replace dashes with 0.0, and remove $s and commas
        # group(2): the value inside the second set of parentheses in line_pattern
        values_string = line_match.group(2).strip().\
        replace('- ', '0.0 ').replace('$', '').replace(',', '')
        # Split on whitespace and convert to float to create a sequence of floating-point numbers
        values = map(float, values_string.split())
        # Append the agency name or revenue source into line_items
        line_items.append(agency)
        # Extend the floating-point numbers into line_items so line_items remains one list
        line_items.extend(values)
        # Append line_item's values into list_of_line_items to generate a list of lists;
        # all of the lines that will become rows in the DataFrame
        list_of_line_items.append(line_items)
    # Convert the list of lists into a Pandas DataFrame and specify the column headings
    df = pd.DataFrame(list_of_line_items, columns=column_headings)
    return df

def create_plot(df, column_to_sort, x_val, y_val, type_of_plot, plot_size, the_title):
    """Create a plot from data in a Pandas DataFrame.

    Arguments:
    df -- A Pandas DataFrame
    column_to_sort -- The column of values to sort
    x_val -- The variable displayed on the x-axis
    y_val -- The variable displayed on the y-axis
    type_of_plot -- A string that specifies the type of plot to create
    plot_size -- A list of 2 numbers that specifies the plot's size
    the_title -- A string to serve as the plot's title
    """
    # Create a figure and an axis for the plot
    fig, ax = plt.subplots()
    # Sort the values in the column_to_sort column in the DataFrame
    df = df.sort_values(by=column_to_sort)
    # Create a plot with x_val on the x-axis and y_val on the y-axis
    # type_of_plot specifies the type of plot to create, plot_size
    # specifies the size of the plot, and the_title specifies the title
    df.plot(ax=ax, x=x_val, y=y_val, kind=type_of_plot, figsize=plot_size, title=the_title)
    # Adjust the plot's parameters so everything fits in the figure area
    plt.tight_layout()
    # Create a PNG filename based on the plot's title, replace spaces with underscores
    pngfile = the_title.replace(' ', '_') + '.png'
    # Save the plot in the current folder
    plt.savefig(pngfile)

# In the Expenditures table, grab all of the lines between Totals and General Government
expenditures_pattern = r'Totals\n+(Legislative, Judicial, Executive.*?)\nGeneral Government:'

# In the Revenues table, grab all of the lines between 2015-16 and either Subtotal or Total
revenues_pattern = r'\d{4}-\d{2}\n(Personal Income Tax.*?)\n +[Subtotal|Total]'

# For the expenditures, grab the agency name in the first set of parentheses
# and grab the dollar values in the second set of parentheses
expense_pattern = r'(K-12 Education|[a-z,& -]+)([$,0-9 -]+)'

# For the revenues, grab the revenue source in the first set of parentheses
# and grab the dollar values in the second set of parentheses
revenue_pattern = r'([a-z, ]+)([$,0-9 -]+)'

# Column headings for the Expenditures DataFrames
expense_columns = ['Agency', 'General', 'Special', 'Bond', 'Totals']

# Column headings for the Revenues DataFrames
revenue_columns = ['Source', 'General', 'Special', 'Total', 'Change']

# Iterate over all PDF files in the folder and process each one in turn
for input_file in glob.glob(os.path.join(input_path, '*.pdf')):
    # Grab the PDF's file name
    filename = os.path.basename(input_file)
    print filename
    # Remove .pdf from the filename so we can use it as the name of the plot and PNG
    plotname = filename.strip('.pdf')

    # Use Tika to parse the PDF
    parsedPDF = parser.from_file(input_file)
    # Extract the text content from the parsed PDF
    pdf = parsedPDF["content"]
    # Convert double newlines into single newlines
    pdf = pdf.replace('\n\n', '\n')

    # Create a Pandas DataFrame from the lines of text in the Expenditures table in the PDF
    expense_df = create_df(pdf, expenditures_pattern, expense_pattern, expense_columns)
    # Create a Pandas DataFrame from the lines of text in the Revenues table in the PDF
    revenue_df = create_df(pdf, revenues_pattern, revenue_pattern, revenue_columns)
    print expense_df
    print revenue_df

    # Print the total expenditures and total revenues in the budget to the screen
    print "Total Expenditures: {}".format(expense_df["Totals"].sum())
    print "Total Revenues: {}\n".format(revenue_df["Total"].sum())

    # Create and save a horizontal bar plot based on the data in the Expenditures table
    create_plot(expense_df, "Totals", ["Agency"], ["Totals"], 'barh', [20,10], \
    plotname+"Expenditures")
    # Create and save a horizontal bar plot based on the data in the Revenues table
    create_plot(revenue_df, "Total", ["Source"], ["Total"], 'barh', [20,10], \
    plotname+"Revenues")

Save this code in a file named parse_pdfs_with_tika.py in the same folder as the one containing the three CA Budget PDFs. Then you can run the script on the command line with the following command:

./parse_pdfs_with_tika.py .

I added docstrings to the two functions, create_df and create_plot, and comments above nearly every line of code in an effort to make the code as self-explanatory as possible. I created the two functions to avoid duplicating code because we perform these operations twice for each file, once for revenues and once for expenditures. We use a for loop to iterate over the PDFs and for each one we extract the lines of text we care about, convert the text into a Pandas DataFrame, display some of the DataFrame’s information, and save plots of the total values in the revenues and expenditures tables.

Results

Terminal Output
(1 of 3 pairs of DataFrames)

terminal_output

PNG File: Expenditures by Agency 2015-16
(1 of 6 PNG Files)

2015-16CABudgetSummaryChartsExpenditures

In this post I’ve tried to convey that Tika is a great resource for parsing PDFs by demonstrating how you can use it to parse budget data from PDF documents provided by a government agency. As my friend’s experience illustrates, there may be other situations in which you need to extract data from PDFs. With Tika, PDFs become another rich source of data for your analysis.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s