Parsing PDFs in Python with Tika

A few months ago, one of my friends asked me if I could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few Python-based PDF parsers and decided to try Tika, which is a port of Apache Tika.  Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided to write a blog post about parsing PDFs with Tika.

tika

California Budget PDFs

To demonstrate parsing PDFs with Tika, I knew I’d need some PDFs. I was thinking about which ones to use and remembered a blog post I’d read on scraping budget data from a government website. Governments also provide data in PDF format, so I decided it would be helpful to demonstrate how to parse data from PDFs available on a government website. This way, with these two blog posts, you have examples of acquiring government data, even if it’s embedded in HTML or PDFs. The three PDFs we’ll parse in this post are:

2015-16 State of California Enacted Budget Summary Charts
2014-15 State of California Enacted Budget Summary Charts
2013-14 State of California Enacted Budget Summary Charts

ca_budget

Each of these PDFs contains several tables that summarize total revenues and expenditures, general fund revenues and expenditures, expenditures by agency, and revenue sources. For this post, let’s extract the data on expenditures by agency and revenue sources. In the 2015-16 Budget PDF, the titles for these two tables are:

2015-16 Total State Expenditures by Agency

expenditures

2015-16 Revenue Sources

revenues

To follow along with the rest of this tutorial you’ll need to download the three PDFs and ensure you’ve installed Tika. You can download the three PDFs here:

http://www.ebudget.ca.gov/2015-16/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2014-15/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2013-14/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

You can install Tika by running the following command in a Terminal window:

pip install --user tika

IPython

Before we dive into parsing all of the PDFs, let’s use one of the PDFs, 2015-16CABudgetSummaryCharts.pdf, to become familiar with Tika and its output. We can use IPython to explore Tika’s output interactively:

ipython

from tika import parser

parsedPDF = parser.from_file("2015-16CABudgetSummaryCharts.pdf")

You can type the name of the variable, a period, and then hit tab to view a list of all of the methods available to you:

parsedPDF.

ipython1

There are many options related to keys and values, so it appears the variable contains a dictionary. Let’s view the dictionary’s keys:

parsedPDF.viewkeys()

parsedPDF.keys()

The dictionary’s keys are metadata and content. Let’s take a look at the values associated with these keys:

parsedPDF["metadata"]

The value associated with the key “metadata” is another dictionary. As you’d expect based on the name of the key, its key-value pairs provide metadata about the parsed PDF.

ipython2

Now let’s take a look at the value associated with “content”.

parsedPDF["content"]

The value associated with the key “content” is a string. As you’d expect, the string contains the PDF’s text content.

ipython3

Now that we know the types of objects and values Tika provides to us, let’s write a Python script to parse all three of the PDFs. The script will iterate over the PDF files in a folder and, for each one, parse the text from the file, select the lines of text associated with the expenditures by agency and revenue sources tables, convert each of these selected lines of text into a Pandas DataFrame, display the DataFrame, and create and save a horizontal bar plot of the totals column for the expenditures and revenues. So, after you run this script, you’ll have six new plots, one for revenues and one for expenditures for each of the three PDF files, in the folder in which you ran the script.

Python Script

To parse the three PDFs, create a new Python script named parse_pdfs_with_tika.py and add the following lines of code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import glob
import os
import re
import sys
import pandas as pd
import matplotlib
matplotlib.use('AGG')
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'

from tika import parser

input_path = sys.argv[1]

def create_df(pdf_content, content_pattern, line_pattern, column_headings):
    """Create a Pandas DataFrame from lines of text in a PDF.

    Arguments:
    pdf_content -- all of the text Tika parses from the PDF
    content_pattern -- a pattern that identifies the set of lines
    that will become rows in the DataFrame
    line_pattern -- a pattern that separates the agency name or revenue source
    from the dollar values in the line
    column_headings -- the list of column headings for the DataFrame
    """
    list_of_line_items = []
    # Grab all of the lines of text that match the pattern in content_pattern
    content_match = re.search(content_pattern, pdf_content, re.DOTALL)
    # group(1): only keep the lines between the parentheses in the pattern
    content_match = content_match.group(1)
    # Split on newlines to create a sequence of strings
    content_match = content_match.split('\n')
    # Iterate over each line
    for item in content_match:
        # Create a list to hold the values in the line we want to retain
        line_items = []
        # Use line_pattern to separate the agency name or revenue source
        # from the dollar values in the line
        line_match = re.search(line_pattern, item, re.I)
        # Grab the agency name or revenue source, strip whitespace, and remove commas
        # group(1): the value inside the first set of parentheses in line_pattern
        agency = line_match.group(1).strip().replace(',', '')
        # Grab the dollar values, strip whitespace, replace dashes with 0.0, and remove $s and commas
        # group(2): the value inside the second set of parentheses in line_pattern
        values_string = line_match.group(2).strip().\
        replace('- ', '0.0 ').replace('$', '').replace(',', '')
        # Split on whitespace and convert to float to create a sequence of floating-point numbers
        values = map(float, values_string.split())
        # Append the agency name or revenue source into line_items
        line_items.append(agency)
        # Extend the floating-point numbers into line_items so line_items remains one list
        line_items.extend(values)
        # Append line_item's values into list_of_line_items to generate a list of lists;
        # all of the lines that will become rows in the DataFrame
        list_of_line_items.append(line_items)
    # Convert the list of lists into a Pandas DataFrame and specify the column headings
    df = pd.DataFrame(list_of_line_items, columns=column_headings)
    return df

def create_plot(df, column_to_sort, x_val, y_val, type_of_plot, plot_size, the_title):
    """Create a plot from data in a Pandas DataFrame.

    Arguments:
    df -- A Pandas DataFrame
    column_to_sort -- The column of values to sort
    x_val -- The variable displayed on the x-axis
    y_val -- The variable displayed on the y-axis
    type_of_plot -- A string that specifies the type of plot to create
    plot_size -- A list of 2 numbers that specifies the plot's size
    the_title -- A string to serve as the plot's title
    """
    # Create a figure and an axis for the plot
    fig, ax = plt.subplots()
    # Sort the values in the column_to_sort column in the DataFrame
    df = df.sort_values(by=column_to_sort)
    # Create a plot with x_val on the x-axis and y_val on the y-axis
    # type_of_plot specifies the type of plot to create, plot_size
    # specifies the size of the plot, and the_title specifies the title
    df.plot(ax=ax, x=x_val, y=y_val, kind=type_of_plot, figsize=plot_size, title=the_title)
    # Adjust the plot's parameters so everything fits in the figure area
    plt.tight_layout()
    # Create a PNG filename based on the plot's title, replace spaces with underscores
    pngfile = the_title.replace(' ', '_') + '.png'
    # Save the plot in the current folder
    plt.savefig(pngfile)

# In the Expenditures table, grab all of the lines between Totals and General Government
expenditures_pattern = r'Totals\n+(Legislative, Judicial, Executive.*?)\nGeneral Government:'

# In the Revenues table, grab all of the lines between 2015-16 and either Subtotal or Total
revenues_pattern = r'\d{4}-\d{2}\n(Personal Income Tax.*?)\n +[Subtotal|Total]'

# For the expenditures, grab the agency name in the first set of parentheses
# and grab the dollar values in the second set of parentheses
expense_pattern = r'(K-12 Education|[a-z,& -]+)([$,0-9 -]+)'

# For the revenues, grab the revenue source in the first set of parentheses
# and grab the dollar values in the second set of parentheses
revenue_pattern = r'([a-z, ]+)([$,0-9 -]+)'

# Column headings for the Expenditures DataFrames
expense_columns = ['Agency', 'General', 'Special', 'Bond', 'Totals']

# Column headings for the Revenues DataFrames
revenue_columns = ['Source', 'General', 'Special', 'Total', 'Change']

# Iterate over all PDF files in the folder and process each one in turn
for input_file in glob.glob(os.path.join(input_path, '*.pdf')):
    # Grab the PDF's file name
    filename = os.path.basename(input_file)
    print filename
    # Remove .pdf from the filename so we can use it as the name of the plot and PNG
    plotname = filename.strip('.pdf')

    # Use Tika to parse the PDF
    parsedPDF = parser.from_file(input_file)
    # Extract the text content from the parsed PDF
    pdf = parsedPDF["content"]
    # Convert double newlines into single newlines
    pdf = pdf.replace('\n\n', '\n')

    # Create a Pandas DataFrame from the lines of text in the Expenditures table in the PDF
    expense_df = create_df(pdf, expenditures_pattern, expense_pattern, expense_columns)
    # Create a Pandas DataFrame from the lines of text in the Revenues table in the PDF
    revenue_df = create_df(pdf, revenues_pattern, revenue_pattern, revenue_columns)
    print expense_df
    print revenue_df

    # Print the total expenditures and total revenues in the budget to the screen
    print "Total Expenditures: {}".format(expense_df["Totals"].sum())
    print "Total Revenues: {}\n".format(revenue_df["Total"].sum())

    # Create and save a horizontal bar plot based on the data in the Expenditures table
    create_plot(expense_df, "Totals", ["Agency"], ["Totals"], 'barh', [20,10], \
    plotname+"Expenditures")
    # Create and save a horizontal bar plot based on the data in the Revenues table
    create_plot(revenue_df, "Total", ["Source"], ["Total"], 'barh', [20,10], \
    plotname+"Revenues")

Save this code in a file named parse_pdfs_with_tika.py in the same folder as the one containing the three CA Budget PDFs. Then you can run the script on the command line with the following command:

./parse_pdfs_with_tika.py .

I added docstrings to the two functions, create_df and create_plot, and comments above nearly every line of code in an effort to make the code as self-explanatory as possible. I created the two functions to avoid duplicating code because we perform these operations twice for each file, once for revenues and once for expenditures. We use a for loop to iterate over the PDFs and for each one we extract the lines of text we care about, convert the text into a Pandas DataFrame, display some of the DataFrame’s information, and save plots of the total values in the revenues and expenditures tables.

Results

Terminal Output
(1 of 3 pairs of DataFrames)

terminal_output

PNG File: Expenditures by Agency 2015-16
(1 of 6 PNG Files)

2015-16CABudgetSummaryChartsExpenditures

In this post I’ve tried to convey that Tika is a great resource for parsing PDFs by demonstrating how you can use it to parse budget data from PDF documents provided by a government agency. As my friend’s experience illustrates, there may be other situations in which you need to extract data from PDFs. With Tika, PDFs become another rich source of data for your analysis.

Advertisements

Foundations for Analytics with Python: From Non-programmer to Hacker

I’m excited to share that O’Reilly Media is about to publish my new book, Foundations for Analytics with Python: From Non-programmer to Hacker. The book is geared toward people who have no prior programming experience but deal with data every day and are interested in learning how to scale and automate their work.

Foundations for Analytics with Python by Clinton Brownley, PhD

I did not have a background in programming. I learned it on the job because I recognized it would enable me to automate repetitive actions and accomplish tasks that would be time-consuming or impossible with my current skill set. I read countless books, online tutorials, and blog posts in those first few weeks and months as I attempted to get my first program for work to do something useful for me. It’s difficult to fully describe how exhilarating and empowering it was when I finally got the program to work correctly. Needless to say, I was hooked, and I haven’t looked back.

I wrote the book with a few objectives in mind:

  • Be accessible to ambitious non-programmers
  • Be practical, so you can immediately see how you can use the code at work
  • Teach fundamental programming concepts and techniques and also provide alternative, shorter code that performs the same actions
  • Make the learning curve as short and shallow as possible so you can enjoy the fruits of your labor as quickly as possible

The book’s features reflect these objectives:

  • Each section focuses on one specific task, so you can learn how to accomplish that task without distractions
  • Each section is a complete, self-contained program, so you don’t have to remember to combine a bunch of code snippets to make them work
  • In the CSV and Excel chapters, each section of code has two versions, a base Python version and a Pandas version. The base Python version teaches you fundamental concepts and techniques. The Pandas version shortens and simplifies the code you need to write to accomplish the task
  • Uses the Anaconda Python 3 distribution, which bundles the newest version of Python with some of the most popular add-in packages
  • Includes screen shots of the input files, Python code, command line, and output files
  • Common data formats, including plain text, CSV, and Excel files, and databases
  • Common data processing tasks, including filtering for specific rows, selecting specific columns, and calculating summary statistics
  • Chapters on data analysis, plotting and graphing, and automation
  • Three real-world applications that illustrate how you can combine and extend techniques from earlier chapters to accomplish important data processing tasks
  • Both Windows and Mac commands and screen shots

To give you a feel for the book, let me provide a few sections of code from the book and the table of contents. The first section of code comes from the CSV chapter, the second section of code from the Excel chapter, and the third section of code from the Database chapter.  The brief comments after each section of code are for this blog post, they are not in the book.  If you want to see what other topics are included in the book, please see the table of contents at the bottom of this post.

Example Section #1: CSV Files

Reading and Writing a CSV File

Version #1: Base Python

#!/usr/bin/env python3
import csv
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, 'r', newline='') as csv_in_file:
    with open(output_file, 'w', newline='') as csv_out_file:
        filereader = csv.reader(csv_in_file, delimiter=',')
        filewriter = csv.writer(csv_out_file, delimiter=',')
        for row_list in filereader:
            filewriter.writerow(row_list)

Version #1 demonstrates how to read a CSV input file with base Python’s standard csv module and write the contents to a CSV output file. In the book, I explain every line of code. This first example gives you the ability to transfer all of your data to an output file. The subsequent examples in the chapter show you how to select specific data to write to the output file and how to process multiple CSV files.

Version #2: Pandas Add-in Module

#!/usr/bin/env python3
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
print(data_frame)
data_frame.to_csv(output_file, index=False)

Version #2 demonstrates how to accomplish the same task with Pandas. As you can see, you simply use read_csv to read the input file and to_csv to write to the output file.

Example Section #2: Excel Files

Reading and Writing an Excel Worksheet

Version #1: xlrd and xlwt Add-in Modules

#!/usr/bin/env python3
import sys
from xlrd import open_workbook
from xlwt import Workbook

input_file = sys.argv[1]
output_file = sys.argv[2]

output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('output_worksheet_name')

with open_workbook(input_file) as workbook:
    worksheet = workbook.sheet_by_name('input_worksheet_name')
    for row_index in range(worksheet.nrows):
        for column_index in range(worksheet.ncols):
            output_worksheet.write(row_index, column_index, \
                worksheet.cell_value(row_index, column_index))
output_workbook.save(output_file)

Version #1 demonstrates how to read and write an Excel worksheet with base Python and the xlrd and xlwt add-in modules. Again, this first example gives you the ability to transfer all of the data on one worksheet to an output file. The subsequent examples in the chapter show you how to select specific data to write to the output file, how to process multiple worksheets, and how to process multiple workbooks.

Version #2: Pandas Add-in Module

#!/usr/bin/env python3
import pandas as pd
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_excel(input_file, sheetname='input_worksheet_name')
writer = pd.ExcelWriter(output_file)
data_frame.to_excel(writer, sheet_name='output_worksheet_name', index=False)
writer.save()

Version #2 demonstrates how to accomplish the same task with Pandas. Again, you simply use read_excel to read the input worksheet and to_excel to write to the output worksheet.

Example Section #3: Databases

Query a table and write results to a file

#!/usr/bin/env python
import csv
import MySQLdb
import sys

output_file = sys.argv[1]

con = MySQLdb.connect(host='localhost', port=3306, db='my_suppliers', user='my_username', passwd='my_password')
c = con.cursor()

filewriter = csv.writer(open(output_file, 'wb'), delimiter=',')
header = ['Supplier Name','Invoice Number','Part Number','Cost','Purchase Date']
filewriter.writerow(header)

c.execute("""SELECT * FROM Suppliers WHERE Cost > 700.0;""")
rows = c.fetchall()
for row in rows:
    filewriter.writerow(row)

This example demonstrates how to connect to a database, query a table, and write the resulting data to a CSV output file. Other examples in the chapter explain how to load data into a database table from a file and update records in a table based on data in a file.

I hope these examples give you a feel for the book. If you want to see what other topics are included in the book, please see the table of contents shown below. Foundations for Analytics with Python is scheduled to be available in May 2016. Please keep an eye out for it, and if you know other people who may be interested please point them to this blog post and the Amazon link.  Thank you : )

 

TABLE OF CONTENTS

CHAPTER
Introduction
Why Read This Book/Why Learn These Skills
Who Is This Book For
Why Windows
Why Python
Install Anaconda Python
Text Editors
Download Book Materials
Base Python and Pandas
Overview of Chapters

CHAPTER
Python Basics
How To Create a Python Script
How To Run a Python Script
Numbers
Strings
Regular Expressions/Pattern Matching
Dates
Lists
Tuples
Dictionaries
Control Flow
Functions
Exceptions
Reading a Text File
Reading Multiple Text Files with Glob
Writing to a Text File
Writing to a Comma Separated Values “CSV” File
Print Statements

CHAPTER
Comma Separated Values “CSV” Text Files
Reading and Writing a CSV File (String Manipulation)
Reading and Writing a CSV File (Standard csv Module)
Filtering for Rows
    Value in Row Meets a Condition
    Value in Row is in a Set of Interest
    Value in Row Matches a Pattern (Regular Expression)
Selecting Columns
    Index Values
    Column Headings
Reading Multiple CSV Files
    Count Number of Files and Rows and Columns in Each File
    Concatenate Data From Multiple Files
    Sum and Average a Set of Values Per File
Selecting Contiguous Rows
Adding a Header Row

CHAPTER
Microsoft Excel Files
Introspecting an Excel Workbook
Reading a Single Worksheet
    Formatting Dates
    Filtering for Rows
        Value in Row Meets a Condition
        Value in Row is in a Set of Interest
        Value in Row Matches a Pattern (Regular Expression)
    Selecting Columns
        Index Values
        Column Headings
Reading All Worksheets
    Filtering for Rows from All Worksheets
    Selecting Columns from All Worksheets
Reading a Subset of Worksheets
    Filtering for Rows from Subset of Worksheets
    Selecting Columns from Subset of Worksheets
Reading Multiple Workbooks
    Count Number of Workbooks and Rows and Columns in Each Workbook
    Concatenate Data from Multiple Workbooks
    Sum and Average a Set of Values Per Worksheet Per Workbook

CHAPTER
Databases
Python’s Standard sqlite3 Module
    Create a Database
    Create a Database Table
    Insert Hand-written Data into a Database Table
    Query a Database Table
    Insert Data from a CSV File into a Database Table
    Update Records in a Database Table with Data from a CSV File
MySQL Database
    Create a Database
    Create a Database Table
    Insert Data from a CSV File into a Database Table
    Query a Database Table and Write Output to a CSV File
    Update Records in a Database Table with Data from a CSV File

CHAPTER
Applications
Find a Set of Items in a Large Collection of Excel and CSV Files
Parse a CSV File and Calculate a Statistic for Any Number of Categories
Parse a Text File and Calculate a Statistic for Any Number of Categories

CHAPTER
Graphing and Plotting
matplotlib
pandas
ggplot
seaborn

CHAPTER
Data Analysis
Descriptive statistics
Regression
Classification

CHAPTER
Automation
Windows: scheduled tasks
Mac: cron jobs

CHAPTER
Conclusion
Where To Go From Here
    Additional Built-Ins/Standard Modules
    Additional Add-In Modules
    Data Structures
How To Go From Here

APPENDIX
Downloads
Python
xlrd
mysqlclient/MySQL-python/MySQLdb
MySQL

Intro to Julia: Filtering Rows with R, Python, and Julia

In one of my earlier posts I introduced the Julia programming language by comparing how you can read and write CSV files in R, Python, and Julia. In this post I’d like to build on that comparison by describing how you can filter for specific rows in a data set in each language based on a filtering condition, set of interest, and pattern (i.e. regular expression). We’ll be using the same wine quality data set we used in the earlier post, which is available here: wine quality

We’ll use the same three row filters in all of the examples so it’s easy to confirm that the output is consistent across all three languages:

  1. The first row filter specifies a condition that restricts the output to rows where the value in the quality column is greater than 7.
  2. The second row filter specifies a set of interest that restricts the output to rows where the value in the quality column is either 7 or 8.
  3. The third row filter specifies a pattern that restricts the output to rows where the value in the quality column contains an 8.

R

To begin, let’s see how you can filter for specific rows in R. First we’ll cover how to filter for rows with base R, and then we’ll describe how to accomplish the same task with the data.table and dplyr packages, which are popular packages for managing data in R.

Base R

The following script illustrates how to read data into a data frame, filter for specific rows based on a filter condition, set of interest, and pattern, and write the output of interest to an output file. All three filtering sections show two different ways to filter the rows, first using row indexing and then using the subset function.

Copy and paste the following code into a text file and then save the file as filter_rows.R

#!/usr/bin/env Rscript
# For more information, visit: https://cbrownley.wordpress.com/

#Collect the command line arguments into a variable called args
args <- commandArgs(trailingOnly = TRUE)
# Assign the first command line argument to a variable called input_file
input_file <- args[1]
# Assign the second command line argument to a variable called output_file
output_file <- args[2]

# Use R’s read.csv function to read the data into a variable called wine
# read.csv expects a CSV file with a header row, so
# sep = ',' and header = TRUE are default values
# stringsAsFactors = FALSE means don’t convert character vectors into factors
wine <- read.csv(input_file, sep = ',', header = TRUE, stringsAsFactors = FALSE)

# Row Value Meets Specific Condition
wine_condition <- wine[which(wine$quality > 7), ]
# Using subset function
wine_condition <- subset(wine, quality > 7)
#print(wine_condition)

# Row Value In Set of Interest
set_of_interest <- c(7, 8)
wine_set <- wine[which(wine$quality %in% set_of_interest), ]
# Using subset function
wine_set <- subset(wine, quality %in% set_of_interest)
#print(wine_set)

# Row Value Matches Specific Pattern
pattern <- '^8$'
wine_pattern <- wine[grep(pattern, wine$quality, ignore.case=TRUE, perl=TRUE), ]
# Using subset function
#wine_pattern <- subset(wine, quality==pattern)
print(wine_pattern)

# Use R’s write.csv function to write the data in the variable wine to the output file
write.csv(wine_pattern, file = output_file, row.names = FALSE)

The section that filters for rows based on a condition only includes one condition, i.e. wine$quality > 7, but you can add more conditions with ANDs, &, and ORs, |. For example, to filter for rows where quality > 7 AND alcohol > 13.0 you can use:
wine_condition <- wine[which(wine$quality > 7 & wine$alcohol > 13.0), ]

Similarly, to filter for rows where quality < 4 OR alcohol > 13.0 you can use:
wine_condition <- wine[which(wine$quality < 4 | wine$alcohol > 13.0), ]

If instead you want to exclude a selection of rows you can negate the which function with a dash, -, like this:
wine_condition <- wine[-which(wine$quality > 7), ]

The section that filters for rows based on a set of interest uses the which function and the %in% binary operator to select rows where the value in the quality column is one of the two values in the set of interest. It’s convenient to assign the values of interest to a variable and then use the variable in the filtering condition so that if the values of interest change you only have to make one change where the values are assigned to the variable.

The section that filters for rows based on a pattern uses the Unix-inspired grep command to select rows where the pattern appears somewhere in the value in the quality column. The ^ metacharacter indicates that the 8 appears at the beginning of the value and the $ metacharacter indicates that the 8 appears at the end of the value, so enclosing the 8 between both ensures that grep looks for rows where 8 is the only value in the quality column. The ignore.case argument isn’t necessary in this case since we’re looking for number, but I included it to show you that it’s available and where to put it if you need it. Similarly, the ^ and $ metacharacters and perl argument aren’t necessary either since we’re searching for a simple number, but I included them to demonstrate how you can use a regular expression and the perl argument to search for a specific pattern.

Now run the following two commands in a Terminal window to make the script executable and to run the script:

chmod +x filter_rows.R
./filter_rows.R winequality-red.csv output/output_R.csv

When you run these commands you’ll see the following output printed to your Terminal screen. In addition, the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder inside your current folder.

Base R

R package: data.table

Now that we know how to filter for specific rows in base R, let’s discuss how to filter for rows with the data.table package. The following script illustrates how to do so. Copy and paste the following code into a text file and then save the file as filter_rows_data_table.R

#!/usr/bin/env Rscript
require(data.table)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)

# Row Value Meets Specific Condition
wine_condition <- wine[quality > 7]

# Row Value In Set of Interest
set_of_interest <- c(7, 8)
wine_set <- wine[quality %in% set_of_interest]

# Row Value Matches Specific Pattern
pattern <- '8'
wine_pattern <- wine[quality==pattern]
print(wine_pattern)

write.csv(wine_pattern, file = output_file, row.names = FALSE)

The filtering syntax is very similar to the syntax we used in the base R versions, so you can look in the base R section above for some explanations about the syntax. Now make the script executable and then run the script:

chmod +x filter_rows_data_table.R
./filter_rows_data_table.R winequality-red.csv output/output_R_data_table.csv

When you run these commands you’ll see the same output as you saw with base R printed to your Terminal screen and you’ll have written another CSV file in the output folder.

R package: dplyr

Now let’s see how to filter for rows with the dplyr package. Copy and paste the following code into a text file and then save the file as filter_rows_dplyr.R

#!/usr/bin/env Rscript
require(data.table)
require(dplyr)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)
wine <- tbl_df(wine)

# Row Value Meets Specific Condition
wine_condition <- wine %>% filter(quality > 7)

# Row Value In Set of Interest
set_of_interest <- c(7, 8)
wine_set <- wine %>% filter(quality %in% set_of_interest)

# Row Value Matches Specific Pattern
pattern <- '8'
wine_pattern <- wine %>% filter(quality==pattern)
print(wine_pattern)

write.csv(wine_pattern, file = output_file, row.names = FALSE)

All three filtering sections use the %>% operator, pulled into dplyr from the magrittr package, and dplyr’s own filter function. The %>% operator is pronounced, “then”, as in “do this, then, do that”. It takes the result of the operation on the left-hand-side of the operator and passes it as the first argument to the operation on the right-hand-side of the operator.

In all three filtering sections we’re simply using it to pass the data set into the filter function. Chaining these two operations doesn’t gain us much – it’s simply to demonstrate how you can use the %>% operator to chain operations together to make your code easier to read and understand. As you’d guess, the filter function filters for rows in the data set with a value that meets the filtering criterion or criteria. Now make the script executable and then run the script:

chmod +x filter_rows_dplyr.R
./filter_rows_dplyr.R winequality-red.csv output/output_R_dplyr.csv

As before, when you run these commands you’ll see the same output as you saw with base R and the data.table package printed to your Terminal screen and you will have written another CSV file in the output folder.

PYTHON

Now that we know how to filter for specific rows in R, let’s discuss how to filter for rows in Python. First we’ll cover how to filter for rows with base Python, and then we’ll describe how to accomplish the same tasks with Pandas, which is a popular package for managing data in Python.

Base Python

The following script illustrates how to process a CSV file line by line, filter for specific rows based on a filter condition, set of interest, and pattern, and write the results to an output file.

Copy and paste the following code into a text file and then save the file as filter_rows.py

#!/usr/bin/env python
# For more information, visit: https://cbrownley.wordpress.com/
# Import Python's built-in csv and sys modules, which have functions
# for processing CSV files and command line arguments, respectively
import csv
import re
import sys

# Assign the first command line argument to a variable called input_file
input_file = sys.argv[1]
# Assign the second command line argument to a variable called output_file
output_file = sys.argv[2]

header_row = True

# Open the input file for reading and close automatically at end
with open(input_file, 'rU') as csv_in_file:
    # Open the output file for writing and close automatically at end
    with open(output_file, 'wb') as csv_out_file:
        # Create a file reader object for reading all of the input data
        filereader = csv.reader(csv_in_file)
        # Create a file writer object for writing to the output file
        filewriter = csv.writer(csv_out_file)
        # Use a for loop to process the rows in the input file one-by-one
        for row in filereader:
            # Process the header row separately from the data rows
            # Print it to the screen, write it to the output file, and then
            # indicate that you're finished with the header row
            if header_row == True:
                print row
                filewriter.writerow(row)
                header_row = False
            # Process the data rows according to three filtering conditions
            else:
                # Row Value Meets Specific Condition
                #if int(row[11]) > 7:
                    #print row
                    #filewriter.writerow(row)

                # Row Value In Set of Interest
                #set_of_interest = [7, 8]
                #if int(row[11]) in set_of_interest:
                    #print row
                    #filewriter.writerow(row)

                # Row Value Matches Specific Pattern
                pattern = re.compile(r'(?P<my_pattern>8)', re.I)
                result = pattern.search(row[11])
                if result == None:
                    pass
                else:
                    print row
                    filewriter.writerow(row)

We process the header row separately from the data rows because we don’t want to test the header row against the filtering conditions. The built-in csv module reads each row from the input file as a list, a.k.a. array, so we use list indexing, row[11], to access the values in the quality column, which is the twelfth column in the data set (in Python, the first array index is 0). The first two filtering sections are fairly straightforward, i.e. in the first section, the integer version of the value in the quality column is > 7 and in the second section it is one of the values in the set of interest. These two sections are currently commented out with # symbols, but you can uncomment the sections one-at-a-time to see how the output changes.

The third section uses the re module to create a regular expression, search for the pattern in the quality column, and print and write the row when the value in the quality column matches the pattern. The re.I argument makes the pattern case-insensitive. As we said in the R section, we don’t need the argument in this case but it’s helpful to know where to include it if you need it. Now make the script executable and then run the script:

chmod +x filter_rows.py
./filter_rows.py winequality-red.csv output/output_Python.csv

When you run these commands you’ll see the following output printed to your Terminal screen. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

Base Python

Python package: Pandas

Now let’s see how to filter for rows with Pandas. Copy and paste the following code into a text file and then save the file as filter_rows_pandas.py

#!/usr/bin/env python
import sys
import string
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)

# Row Value Meets Specific Condition
data_frame_value_meets_condition = data_frame[data_frame['quality'].astype(int) > 7]

# Row Value In Set of Interest
set_of_interest = [7, 8]
data_frame_value_in_set = data_frame[data_frame['quality'].isin(set_of_interest)]

# Row Value Matches Specific Pattern
data_frame_value_matches_pattern = data_frame[data_frame['quality'].astype(str).str.contains("8")]
print(data_frame_value_matches_pattern)

data_frame_value_matches_pattern.to_csv(output_file, index=False)

In the first filtering section we select the quality column, convert the values into integers, and then test whether they’re greater than 7. In the second section we use the isin function to test whether the value in the quality column is one of the values in the set of interest. Finally, in the third section, we use the contains function to test whether the value in the quality column contains 8. There are also startswith and endswith functions in case you need to test whether the value starts with or ends with a specific pattern.

Pandas also has a convenient .ix function that you can use to filter for specific rows and columns at the same time. Here’s how you could modify the first filtering section to use the .ix function: data_frame_value_meets_condition = data_frame.ix[data_frame.quality.astype(int) > 7, :]

You can select the column by typing data_frame.column. Like R, you need to separate the rows and columns sections with a comma, and you use a colon to indicate that you want to select all of the rows or columns (In this case we want to select all of the columns). Now run the following two commands to make the script executable and to run the script:

chmod +x filter_rows_pandas.py
./filter_rows_pandas.py winequality-red.csv output/output_Python_Pandas.csv

When you run these commands you’ll see similar output as you saw with base Python printed to your Terminal screen, although it will be formatted differently. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

JULIA

Now that we know how to filter for specific rows in Python, let’s discuss how to filter for rows in Julia. First we’ll cover how to filter for rows with base Julia, and then we’ll describe how to accomplish the same tasks with DataFrames, which is a popular package for managing data in Julia.

Base Julia

The following script illustrates how to read a CSV file line by line, filter for specific rows based on a filter condition, set of interest, and pattern, and write the output of interest to an output file.

Copy and paste this code into a text file and then save the file as filter_rows.jl

#!/usr/bin/env julia
# For more information, visit: https://cbrownley.wordpress.com/

# Assign the first command line argument to a variable called input_file
input_file = ARGS[1]
# Assign the second command line argument to a variable called output_file
output_file = ARGS[2]

# Open the output file for writing
out_file = open(output_file, "w")

header_row = true
# Open the input file for reading and close automatically at end
open(input_file, "r") do in_file
    # Use a for loop to process the rows in the input file one-by-one
    for row in eachline(in_file)
        if header_row == true
            print(row)
            write(out_file, row)
            global header_row = false
        else
            row_array = map(float, split(strip(row), ","))
            #println(row_array)

            # Row Value Meets Specific Condition
            if row_array[12] > 7.0
                #print(join(row_array, ",") * "\n")
                #write(out_file, join(row_array, ",") * "\n")
            end

            # Row Value In Set of Interest
            set_of_interest = Set(7.0, 8.0)
            if in(row_array[12], set_of_interest)
                #print(join(row_array, ",") * "\n")
                #write(out_file, join(row_array, ",") * "\n")
            end

            # Row Value Matches Specific Pattern
            pattern = r"8$"
            if ismatch(pattern, row)
                print(row)
                write(out_file, row)
            end

        # Close the if-else statement
        end
    # Close the for loop
    end
# Close the input file handle
end
# Close the output file handle
#close(out_file)

Let’s explain some of the syntax in this script that’s different from R and Python. For example, the open(…) do statement creates an anonymous function with its own scope, so when we initially define the variable header_row above the open(…) do statement and then assign a new value to the variable inside the open(…) do statement we have to precede the variable name with the keyword global.

It’s helpful to keep in mind that for, while, try, and let blocks also default to local scopes, but they do inherit from a parent scope like the one created by the open(…) do statement. Therefore, if we initially define the variable header_row right beneath the open(…) do statement, then the for loop will inherit the variable from the parent scope and we won’t need to precede the variable name with the keyword global. That is, the following alternative syntax would work too:

open(input_file, "r") do in_file
    header_row = true
    # Use a for loop to process the rows in the input file one-by-one
    for row in eachline(in_file)
        if header_row == true
            print(row)
            write(out_file, row)
            header_row = false

Like base Python without the csv module, Julia reads each row from the file in as a string, so we use the strip function to remove the trailing newline character, then the split function to split the string on commas and convert it into an array, and finally we map the float function to each of the elements in the array to convert all of the values to floating-point numbers.

The println function adds a newline character on the end of the line before printing the line to the screen whereas the print function does not, it prints the line as-is.

The first two row filtering sections use the join function and the string concatenation symbol, *, to create the row of output that will be printed to the screen and written to the output file. In this case, the join function places commas between each of the elements in the array and converts it to a string. Then we add a newline character to the end of the string with the * concatenation symbol.

We test whether the value in the quality column (in Julia, the first array index is 1), is in the set of interest with the in function. Similarly, we test whether the pattern appears in the row using the ismatch function. You’ll notice that we leave the row as a string, i.e. we don’t convert it into an array, to use the ismatch function since ismatch looks for the pattern in a string, not an array.

Now run the following two commands to make the script executable and to run the script:

chmod +x filter_rows.jl
./filter_rows.jl winequality-red.csv output/output_Julia.csv

When you run these commands you’ll see the following printed to your Terminal screen. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

Base Julia

Julia packages: DataFrames and DataFramesMeta

Now that we know how to filter for specific rows in base Julia, let’s discuss how to filter for rows with DataFrames, a popular package for managing tabular data in Julia. The following script illustrates how to do so. Copy and paste this code into a text file and then save the file as filter_rows_data_frames.jl

#!/usr/bin/env julia
using DataFrames
using DataFramesMeta

input_file = ARGS[1]
output_file = ARGS[2]

data_frame = readtable(input_file, separator = ',')

# Row Value Meets Specific Condition
data_frame_value_meets_condition = data_frame[data_frame[:quality] .> 7, :]
#data_frame_value_meets_condition = data_frame[(data_frame[:quality] .== 7) | (data_frame[:quality] .== 8), :]
#println(data_frame_value_meets_condition)

# Row Value In Set of Interest
set_of_interest = Set(7, 8)
data_frame_value_in_set = data_frame[findin(data_frame[:quality], set_of_interest), :]
#println(data_frame_value_in_set)

# Row Value Matches Specific Pattern
pattern = r"8"
data_frame_value_matches_pattern = data_frame[[ismatch(pattern, string(value)) for value in data_frame[:quality]], :]
#data_frame_value_matches_pattern = @where(data_frame, [ismatch(pattern, string(value)) for value in :quality])
println(data_frame_value_matches_pattern)

writetable(output_file, data_frame_value_matches_pattern)

The first filtering section demonstrates how you can use one condition or multiple conditions to filter for specific rows. In Julia, you precede the comparison operator with a period, for example .==, to do element-wise comparisons. To use multiple conditions you wrap each one in parentheses and combine them with ANDs, &, or ORs, |. The colon has the same meaning that it does in R and Python. In this case, we’re using it to retain all of the columns.

We use the findin function to determine which rows have the value 7 or 8 in the quality column in order to retain these rows.

The third filtering section demonstrates two slightly different ways to filter for rows based on a pattern. Both methods use row indexing, the ismatch function, and array comprehensions to look for the pattern in each of the elements in the quality column/array. The only real difference between the two methods is that in the first method we have to specify data_frame[…] twice, whereas the second method uses the @where meta-command from the DataFramesMeta package to enable us to refer to the data_frame once and then refer to the quality column with :quality instead of the slightly more cumbersome data_frame[quality].

Now run the following two commands to make the script executable and to run the script:

chmod +x filter_rows_data_frames.jl
./filter_rows_data_frames.jl winequality-red.csv output/output_Julia_DataFrames.csv

When you run these commands you’ll see similar output as you saw with base Julia printed to your Terminal screen, although it will be formatted differently. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

As you can see, when it comes to filtering for specific rows, the differences in syntax between Python and Julia are very slight. For example, Python’s “if value in set_of_interest” statements are “in(value, set_of_interest)” statements in Julia, and Python’s pattern matching “pattern.search()” statements are “ismatch()” statements in Julia. On the other hand, one difference to keep in mind is that for loops in Julia default to local scope so if you’ve defined a variable outside of a for loop and you need to use it inside the for loop, then you need to precede the variable name with the word global.

Now that we know how to read and write data in a CSV-formatted input file and filter for specific rows with R, Python, and Julia, the next step is to figure out how to filter for specific columns in these languages. Then we can move on to processing lots of files in a directory and also dealing with Excel files. We’ll cover these topics in future posts.

I’d like to thank the Julia users group, especially Nils Gudat and David Gold, for helping me figure out how to use the findin and ismatch functions to filter for specific rows while using the DataFrames package.

All Scripts and Output Files

Simplifying scikit-learn Predictive Modeling with skll APIs and Configuration Files

One blog post I really enjoyed reading last year is yhat’s post, Predicting customer churn with scikit-learn. In the post, the author demonstrates how to estimate, cross-validate, and measure the performance of three predictive models for classification in Python with the scikit-learn package. At the time, I was more familiar with how to do predictive modeling in R with the caret package, as described in the well-written book Applied Predictive Modeling. However, since I also knew some Python and was interested in learning how to use it for predictive modeling, yhat’s post was exactly what I needed to get started.

More recently, I read and followed along with the examples in Jereon Janssens’s excellent book, Data Science at the Command Line. In Chapter 9, the author demonstrates how to use another Python package, the SciKit-Learn Laboratory (a.k.a. skll) package, to implement predictive modeling with scikit-learn from the command line. The skll package is very helpful because it provides an API interface to scikit-learn that simplifies the code you need to implement your predictive models. Additionally, it also enables you to specify your modeling parameters in a configuration file so you can run your models from the command line.

Since I’d become familiar with how to implement predictive modeling with scikit-learn, I really enjoyed learning to use skll’s API and configuration file interfaces. To see how skll’s interfaces simplify the process of using the scikit-learn package to implement predictive modeling, let’s estimate, cross-validate, and measure the performance of four predictive models with scikit-learn and then perform similar operations with skll’s interfaces.

DATA SET

The data set we’ll use in this post comes from the publicly available wine quality data sets, which are available here. There is a file for red wines and a file for white wines. The data set used in this post is these two files concatenated together (with only one header row). The dependent variable in the original research took on integer values representing wine quality, so the data set lends itself to regression tasks. However, in this post we’ll demonstrate how to perform classification tasks by classifying wines as either red or white. You can download the data and prepare the final data set by entering the following commands in a Terminal window:

# Download two CSV files, winequality-red.csv and winequality-white.csv
# from the UCI Machine Learning Repository
# and save the files as wine-red.csv and wine-white.csv
parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-{}.csv > wine-{}.csv" ::: red white

# Check that you now have wine-red.csv and wine-white.csv
# in your current working directory
find . -maxdepth 1 -name "wine*"

# For each file, convert uppercase to lowercase,
# convert semicolons to commas,
# convert spaces to underscores,
# remove double quotes, and
# add a column that says 'red' or 'white' depending on the file
for T in red white; do < wine-$T.csv tr '[A-Z]; ' '[a-z],_' | tr -d \" | sed "s/$/,${T}/" > wine-${T}-clean.csv; done

# Review the first five lines in each file to ensure the changes are correct
head -n 5 wine-{red,white}-clean.csv | fold

# Concatenate the red and white wine files into one file, wine-both-clean.csv
# Retain only one column heading
# Name the last column 'type' since it contains the words 'red' and 'white'
head -n 1 wine-red-clean.csv > wine-both-clean.csv; grep -v quality wine-red-clean.csv >> wine-both-clean.csv; grep -v quality wine-white-clean.csv >> wine-both-clean.csv; sed -i -e "1s/,red/,type/" wine-both-clean.csv

# Review the row counts for red (1,599) and white (4,898) wines
parallel --tag "grep -c {} wine-both-clean.csv" ::: red white

Now that we have our data set, wine-both-clean.csv, let’s move on to estimating our predictive models with scikit-learn.

SCIKIT-LEARN

In order to understand the ways in which skll simplifies the syntax needed to perform predictive modeling with the scikit-learn package, let’s first estimate, cross-validate, and measure the performance of four predictive models with scikit-learn. To do so, copy and paste the following code into a text editor and save the file as classify_wine_scikit_learn.py:

#!/usr/bin/env python
import sys
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LogR
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

input_file = sys.argv[1]

# Read the wine quality data into a Pandas data frame
wine_data_frame = pd.read_csv(input_file)

# Create a red wine binary variable named 'y' for classification
wine_type = wine_data_frame['type']
y = np.where(wine_type == 'red', 1., 0.)

# Specify X, the matrix of predictor variables
features = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
        'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
        'ph', 'sulphates', 'alcohol']
wine_features = wine_data_frame[features]
X = wine_features.as_matrix().astype(np.float)
print "\nFeature space holds %d observations and %d features" % (X.shape)
print "\nAccuracy if you predict all 0s (baseline or benchmark): %0.3f" % (accuracy_score(y, [0 for value in y.tolist()]))
print "\n*****************************************\n"

# Center and scale the predictor variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Specify a function for k-fold cross-validation
def run_cv(X, y, clf, **kwargs):
    kfold = KFold(len(y), n_folds=10, shuffle=True, random_state=123456789)
    print str(clf)
    y_pred = y.copy()
    from time import time
    t0 = time()
    fold = 0
    for train_index, test_index in kfold:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf.fit(X_train, y_train)
        y_pred[test_index] = clf.predict(X_test)
        fold += 1
        print "Finished fold:", str(fold)
    print "Cross-validation took %0.2f seconds." % (time() - t0)
    return y_pred

# K NEAREST NEIGHBORS
# Estimate predicted values for k-nearest-neighbors classification model with cross-validation
y_pred_knn = run_cv(X, y, KNN(n_neighbors=6))

# Calculate performance metrics for k-nearest-neighbors classification model
print "\nK-nearest-neighbors (accuracy): " + "%.3f" % accuracy_score(y, y_pred_knn)
print "\nKNN confusion matrix: "
print(confusion_matrix(y_pred_knn, y))
target_names = ['White (0)', 'Red (1)']
print "\nKNN classification report: "
print(classification_report(y, y_pred_knn, target_names=target_names))
print "\n*****************************************\n"

# LOGISTIC REGRESSION
# Estimate predicted values for logistic/logit classification model with cross-validation
y_pred_logr = run_cv(X, y, LogR(random_state=123456789))

# Calculate performance metrics for logistic/logit classification model
print "\nLogistic (accuracy): " + "%.3f" % accuracy_score(y, y_pred_logr)
print "\nLogistic confusion matrix: "
print(confusion_matrix(y_pred_logr, y))
target_names = ['White (0)', 'Red (1)']
print "\nLogistic classification report: "
print(classification_report(y, y_pred_logr, target_names=target_names))
print "\n*****************************************\n"

# RANDOM FOREST
# Estimate predicted values for random forest classification model with cross-validation
y_pred_rf = run_cv(X, y, RFC(n_estimators=500, random_state=123456789))

# Calculate performance metrics for random forest classification model
print "\nRandom forest (accuracy): " + "%.3f" % accuracy_score(y, y_pred_rf)
print "\nRF confusion matrix: "
print(confusion_matrix(y_pred_rf, y))
target_names = ['White (0)', 'Red (1)']
print "\nRF classification report: "
print(classification_report(y, y_pred_rf, target_names=target_names))
print "\n*****************************************\n"

# SUPPORT VECTOR MACHINES
# Specify grid parameters for the support vector machines classifier (SVC)
# Reference: http://scikit-learn.org/0.11/tutorial/statistical_inference/model_selection.html#grid-search
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
gamma_values = [0.01, 0.1, 1.0, 10.0, 100.0]
param_grid = dict(kernel=['rbf'], gamma=gamma_values, C=C_values)
svc_gscv = GridSearchCV(SVC(cache_size=1000, class_weight='auto', random_state=123456789), param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Estimate predicted values for support vector machines classification model with cross-validation
y_pred_svc = run_cv(X, y, svc_gscv)

# Calculate performance metrics for support vector machines classification model
print "\nSupport vector machines (accuracy): " + "%.3f" % accuracy_score(y, y_pred_svc)
print "\nSVM confusion matrix: "
print(confusion_matrix(y_pred_svc, y))
target_names = ['White (0)', 'Red (1)']
print "\nSVM classification report: "
print(classification_report(y, y_pred_svc, target_names=target_names))

This code is similar to the code discussed in yhat’s post, Predicting customer churn with scikit-learn, so instead of describing the lines of code in detail I’ll summarize the sections of code. At the top, we import all of the functions we’re going to use to manage and transform the data, estimate the models, perform k-fold cross-validation, and evaluate the performance of the models.

Next, we read the data into a Pandas data frame and create a new binary variable named ‘y’ that equals 1.0 for red wines and 0.0 for white wines. Next, we specify which variables are going to serve as the independent, explanatory, predictor variables and transform them into a matrix. At this point, we print the number of rows and columns in the analysis data set and print the baseline, benchmark accuracy we achieve by naïvely predicting all of the observations to be zeros (i.e. the majority of the observations are zeros). Hopefully, the performance of our predictive models will surpass this benchmark performance!

Next, we center and scale the predictor variables so they all have a mean of zero and a standard deviation of one. These transformations help ensure that the learning algorithms and predictive performance of the statistical models aren’t influenced by the predictor variables’ units of measurement.

Next, we define a function for performing k-fold cross-validation. The function specifies a particular random state to be consistent with skll’s default random state. It also includes a few print statements so you can see the folds being processed and see how long the cross-validation takes for different models. You can find an example of stratified k-fold cross-validation, a related procedure that attempts to balance the percentage of observations from each class in each fold, in Bugra’s blog post, An Introduction to Supervised Learning via Scikit Learn. This post also demonstrates how to produce plots of training and test errors, confusion matrices, and variable importance scores.

Finally, we run cross-validation with four predictive models, (1) k-nearest neighbors, (2) logistic regression, (3) random forest, and (4) support vector machines and report on their performance with an accuracy score, a confusion matrix, and a classification report. The confusion matrix shows the number of correctly classified observations along the main diagonal and the classification report includes information on precision, recall, f1-score, and support for each category.

One final point is that the support vector machine section demonstrates how to use grid search to select optimal parameters for the model. To be consistent with skll’s grid search defaults, I instructed grid search to use 3-fold cross-validation, a particular random state, and accuracy to determine the optimal values for C and gamma. This 3-fold cross-validation takes place within the 10-fold cross-validation taking place to measure the model’s predictive performance.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_scikit_learn.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_scikit_learn.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

scikit-learn script output

As you can see, you can achieve an accuracy score of 0.75 by naïvely predicting all of the observations to be zero. Since the rest of the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.18 seconds

K-Nearest Neighbors
Accuracy: 0.994
Processing time: 0.50 seconds

Random Forest
Accuracy: 0.995
Processing time: 36.44 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 225.42 seconds

The output shows that, in this example, all of the models achieve similar accuracy scores. It also shows that logistic regression completes 3 times faster than k-nearest-neighbors and 1,252 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with scikit-learn, let’s take a look at how to do so with skll’s APIs to scikit-learn.

SKLL API

If you want to follow along with the next two sections you’ll need to install skll. Here are instructions for doing so: skll There can be compatibility issues between scikit-learn and skll, depending on the versions you have installed. To follow along with this post, make sure you have scikit-learn==0.15.2 and skll==1.0.1. You can check which versions you have by typing the following on the command line and then hitting Enter: pip freeze. Once you’ve installed skll, copy and paste the following code into a text editor and save the file as classify_wine_skll.py:

#!/usr/bin/env python
import sys
import time
from skll.data.readers import Reader
from skll.learner import Learner

input_file = sys.argv[1]

file_reader = Reader.for_path(input_file, label_col='type')

training_data = file_reader.read()

number_of_folds = 10

def average_accuracy(fold_results):
    import numpy as np
    accuracy = []
    for fold_index in range(number_of_folds):
        accuracy.append(fold_result_list[fold_index][1])
    return np.mean(accuracy)

print "\nLogistic:"
logistic_learner = Learner('LogisticRegression', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nKNN:"
knn_learner = Learner('KNeighborsClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = knn_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nRF:"
rf_learner = Learner('RandomForestClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nSVC:"
svc_learner = Learner('SVC', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = svc_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=True, grid_search_folds=3, grid_objective=u'f1_score_micro', \
param_grid=[{'C':[0.01, 0.1, 1.0, 10.0, 100.0]}, {'gamma':[0.01, 0.1, 1.0, 10.0, 100.0]}], shuffle=True)
print "Grid search %d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

One of the first things you’ll notice is how much less code we need to run the same cross-validation exercise we ran in the previous section. At the top, we import skll’s Reader and Learner objects, which we’ll use to read in the training data and to specify the statistical learning models we want to cross-validate, respectively.

You’ll notice that when we use the Reader object to create a file reader object we specify the name of the column that’s to be used as the dependent, ‘y’ variable. In this example, we specify that the label column is ‘type’ because it contains the class labels. We also create a variable for the number of cross-validation folds so the code is a little more flexible and you can experiment with 5-fold, 10-fold, or 20-fold cross-validation more easily.

The next block of code defines a function to calculate the average accuracy of the model based on cross-validation. We create this simple helper function because, as described in the API’s documentation, the output of cross-validation is a pair of lists. The first list contains the confusion matrix, overall accuracy, per-label precision, recall, and F-measures, and model parameters for each fold. The second list contains the grid search scores, if any, for each fold. This helper function collects all of the accuracy scores from the cross-validation output and returns the average accuracy across all of the folds.

The next four blocks of code initialize the four statistical learning models, cross-validate the models, and report their results. For example, the first learner is logistic regression. We’re not interested in calculating probabilities, so we set probability equal to False. To be consistent with the centering and scaling we did to the predictor variables in base scikit-learn, we set feature scaling to both.

Next, we perform cross-validation with the model. In this case, we set stratified equal to True to use stratified k-fold cross-validation so each fold contains nearly the same number of red and white wines. Since we set number_of_folds to 10 and cv_folds equals number_of_folds, we’re performing 10-fold cross-validation. For logistic regression, we’re not employing grid search, and we set shuffle equal to True so the observations are shuffled before they’re split into folds for cross-validation. Some of these options are the defaults, and some of them are changed to be consistent with the scikit-learn code we used above.

The k-nearest neighbors and random forest sections are identical to the logistic regression section, except for employing different predictive models. To be consistent with the scikit-learn code we used above, the support vector machines section contains some additional code to employ grid search for optimal C and gamma values.

Finally, we print how long cross-validation takes to complete, all of the results for each of the folds, the grid search scores (if any), and the average accuracy across all of the folds.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_skll.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_skll.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

skll script output

Since the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.85 seconds

K-Nearest Neighbors
Accuracy: 0.993
Processing time: 1.22 seconds

Random Forest
Accuracy: 0.994
Processing time: 0.88 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 57.60 seconds

Once again the output shows that, in this example, all of the models achieve similar accuracy scores. In this case, logistic regression completes 1.4 times faster than k-nearest-neighbors and 68 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with skll’s APIs, let’s take a look at how to do so on the command line with skll’s configuration file.

SKLL CONFIGURATION FILE

As I mentioned at the top of this post, Jereon Janssens demonstrates how to use skll’s configuration file set-up in his book, Data Science at the Command Line. You can also read skll’s own tutorial for using the configuration file set-up here.

skll’s configuration file set-up requires an input file to be formatted slightly differently than the way ours is now, so we need to modify our input file. Specifically, we need (1) an additional row index column with unique numbers for each row, (2) the wine type column to contain the floating-point numbers 1.0 and 0.0 instead of the strings red and white, and (3) the file should only contain the binary ‘type’ column and the predictor variables we want to use (i.e. it shouldn’t contain additional variables we don’t intend to use). Since our input file contains the ‘quality’ column and we don’t want to use it as a predictor we need to remove it from the file.

You can create the modified version of the input file by entering the following one-line, piped command in a Terminal window:

< wine-both-clean.csv nl -s, -w4 -n rz -v0 | sed 's/0000,/id,/' | sed 's/red/1\./' | sed 's/white/0\./' | cut -d, -f1-12,14 > wine-both-clean-ids.csv

The nl command adds line numbers to each of the rows. -s, says add a comma after the line number (since we’re using a CSV file). -w4 says make the line numbers four digits wide (we know the input file contains 6,498 rows, so four digits wide will be sufficient). -n rz says insert line numbers according to the rz format, which is right-justified with leading zeros. -v0 says start the line numbering at 0, or in this case, 0000.

Next, the first sed command changes the line number in the first line from 0000, to the column heading id,. The second sed command changes the word red into ‘1.’, and the third sed command changes the word white into ‘0.’. The cut command separates the file into columns based on the comma delimiter and selects, or keeps, columns one to twelve and column fourteen (thereby removing the ‘quality’ variable in column thirteen). The result of these operations is redirected to a new output file called wine-both-clean-ids.csv. We’ll use this new file as our input file.

Now that our input file is ready to be processed, let’s create our skll configuration file. To do so, copy and paste the following code into a text editor and save the file as classify_wine_skll.cfg (Note that the file extension is .cfg instead of .py):

[General]
    experiment_name = Wine
    task = cross_validate

[Input]
    train_directory = .
    featuresets = [["wine-both-clean-ids.csv"]]
    learners = ["RandomForestClassifier", "SVC", "KNeighborsClassifier", "LogisticRegression"]
    label_col = type
    id_col = id
    shuffle = True
    feature_scaling = both

[Tuning]
    objective = accuracy
    grid_search = True
    param_grids = [[], [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}], [], []]

[Output]
    log = results
    results = results
    predictions = results

skll’s configuration file contains four sections, General, Input, Tuning, and Output. You can learn more about each of these sections here. In the General section, we specify a name for the experiment. All of the output file names will start with the word we use. We also specify the task we want to perform, which is cross_validate.

In the Input section, we specify the train_directory, the folder where the configuration file can find the training data set. The period, ‘.’, is shorthand for the current folder. If you save the input file in a different folder, then you’ll have to supply the folder name (e.g. training_data or my_input_files). featuresets contains the name of the input file(s) that contain your features, i.e. explanatory variables. In this case they’re all in one file but, as skll’s tutorial demonstrates, they can be spread across multiple input files. learners is a list of the learners we want to use. label_col indicates which column contains the class labels. id_col indicates which column contains the unique row numbers. Once again, to be consistent with the preceding examples, shuffle equals True and feature_scaling equals both.

In the Tuning section, we specify we want to use accuracy as our objective. Since we want to use grid search for one of our models, we set grid_search to True. When grid search is True, you have to supply a list of lists to param_grids, one list for each model. A list can be empty if you don’t want to perform grid search for the model, but you still need a set of square brackets for each model. If you do want to perform grid search for a model, then you supply a dictionary of the parameters and values you want to search over and optimize. Since we’re performing grid search for the support vector machines model we supply ‘C’ and ‘gamma’ as two keys and lists of values to search over as the values associated with each key.

Finally, in the Output section, we specify the name of the folder we want all of the output to be saved in. In this case, all of the logs, results, and predictions will be saved in a folder named ‘results’ inside our current folder.

Run this configuration file (a.k.a. experiment) by typing the following on the command line and then hitting Enter:
run_experiment classify_wine_skll.cfg

After you hit Enter, you’ll see the following output printed to your command prompt window after all four of the models have completed:

Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done

In addition, several output files have been written in the ‘results’ folder inside your current folder. You can cd and/or ls into the results folder to view the output. skll creates four files for each model: a log file, a predictions file, a results file, and a results file formatted as JSON. skll also creates a summary file that contains details on each of the models and each of the cross-validation folds.

One of the measures we’ve been using to compare the models is average accuracy. This value is available in the summary file. Jereon Janssens demonstrates how to access and print this value using some helpful command line tools in his book, but you can also print it out with basic Unix commands. To view the average accuracy for each of the models, type the following command on the command line and then hit Enter (assuming you named the output folder ‘results’ and it’s inside your current folder):

grep average results/Wine_summary.tsv | cut -f1,13 | awk -F\t '{ print $2 ": " $1 }'

The grep command filters for rows in the tab-delimited summary file that contain the word average. The cut command separates the columns in the file based on tabs, which we didn’t have to specify because it’s the default in cut. Then we select the first column (i.e. the average accuracy score) and the thirteenth column (i.e. the name of the classifier). Finally, the awk command re-arranges the two pieces of information so that what’s printed to the screen is the name of the classifier, followed by a colon and a space, and then the average accuracy score. When you run this command you’ll see the following output printed to the Terminal window:

skll configuration file output

Once again, all of the models have similar accuracy scores.

As we’ve seen, skll’s APIs and configuration file set-up encapsulate and simplify a lot of the basic scikit-learn code you need to read input data, transform variables, and estimate, cross-validate, and measure the performance of predictive models. Now that you’re familiar with skll’s general interfaces and syntax, try modifying the code to use your own input data, estimate different models, measure performance with a different metric, or perform other tasks, like predict or evaluate instead of cross_validate. Also be sure to check out the additional resources noted throughout this post for supplementary explanations and examples. Have a great time experimenting with skll’s interfaces!

Intro to Julia: Reading and Writing CSV Files with R, Python, and Julia

Last year I read yhat’s blog post, Neural networks and a dive into Julia, which provides an engaging introduction to Julia, a high-level, high-performance programming language for technical computing.

One aspect of the language I found intriguing was its aim to be as fast as C, as easy to use as Python, and as easy for statistics as R. I enjoyed seeing that Julia’s syntax is similar to Python, it has several graphing packages, including a ggplot2-inspired package called Gadfly, and it has a several structured data, statistics, and machine learning packages, including DataFrames for dealing with tabular data and StatsBase and MLBase that provide tools for statistics and machine learning operations.

There are lots of great resources for learning Julia. There are introductory books, like “Getting Started with Julia Programming,” by Ivo Balbaert, and “The Julia Express,” by Bogomil Kaminski. There are online tutorials, like Programming in Julia, Julia by Example, Learn Julia in Y minutes, and Learn Julia the Hard Way. There are also video tutorials, including two “Introduction to Julia” videos by David Sanders at SciPy 2014 and a set of ten Julia video tutorials recorded at MIT in 2013.

Since I’ve been using Python and R to analyze data, and Julia aspires to make the best features of these languages available in one place, I decided to try Julia to see if it would be worthwhile to incorporate it into my toolbox. One of the first things I wanted to learn was the new Julia syntax I’d need to use to perform the operations I’ve been carrying out in Python and R. Some of the most common operations I perform are reading text and delimited input files and writing results to output files. Since these are very common operations, let’s discuss how to perform these operations in R, Python, and Julia. In a later post we can discuss different ways to filter for specific rows and columns in these languages.

To begin, let’s create a folder to work in and name it “workspace”. Next, let’s download a publicly-available data set, e.g. wine-quality, into the folder. Let’s also create another folder called “output” inside the workspace folder where we can save the output files. At this point, we have the following set up:

folder_structure

R
Now that we have our workspace and an input file, let’s create R, Python, and Julia scripts to read the input data and write it to an output file. To create the R script, open a text editor and enter the following code:

#!/usr/bin/env Rscript
# For more information, visit: cbrownley.wordpress.com

#Collect the command line arguments into a variable called args
args <- commandArgs(trailingOnly = TRUE)
# Assign the first command line argument to a variable called input_file
input_file <- args[1]
# Assign the second command line argument to a variable called output_file
output_file <- args[2]

# Use R’s read.csv function to read the data into a variable called wine
# read.csv expects a CSV file with a header row, so
# sep = ',' and header = TRUE are default values
# stringsAsFactors = FALSE means don’t convert character vectors into factors
wine <- read.csv(input_file, sep = ',', header = TRUE, stringsAsFactors = FALSE)

# Use R’s write.csv function to write the data in the variable wine to the output file
# row.names = FALSE means don’t write an extra column of row names
# to the output file; we only want the original data columns
write.csv(wine, file = output_file, row.names = FALSE)

read_csv_R

Once you’ve pasted this code into the file, save the file as “read_csv.R” in the workspace folder and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.R
./read_csv.R winequality-red.csv output/output_R.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_R.csv in the output folder.

A popular R package for reading and managing data is the data.table package. To use the data.table package instead of base R in the script, all you would need to do is add one require statement and edit the line that reads the contents of the input file into a variable:

#!/usr/bin/env Rscript
require(data.table)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)

write.csv(wine, file = output_file, row.names = FALSE)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_table.R”, run the same chmod command on this file, and then substitute this R script in the last command shown above:
./read_csv_data_table.R winequality-red.csv output/output_R_data_table.csv

Python
Now let’s create a Python script to perform the same operations. To create the Python script, open a text editor and enter the following code:

#!/usr/bin/env python
# For more information, visit: cbrownley.wordpress.com

# Import Python's built-in csv and sys modules, which have functions
# for processing CSV files and command line arguments, respectively
import csv
import sys

# Assign the first command line argument to a variable called input_file
input_file = sys.argv[1]
# Assign the second command line argument to a variable called output_file
output_file = sys.argv[2]

# Open the input file for reading and close automatically at end
with open(input_file, 'rU') as csv_in_file:
    # Open the output file for writing and close automatically at end
    with open(output_file, 'wb') as csv_out_file:
        # Create a file reader object for reading all of the input data
        filereader = csv.reader(csv_in_file)
        # Create a file writer object for writing to the output file
        filewriter = csv.writer(csv_out_file)
        # Use a for loop to process the rows in the input file one-by-one
        for row in filereader:
            # Write the row of data to the output file
            filewriter.writerow(row)

read_csv_Python

Once you’ve pasted this code into the file, save the file as “read_csv.py” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.py
./read_csv.py winequality-red.csv output/output_Python.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Python.csv in the output folder.

A popular Python package for reading and managing tabular data is Pandas. Pandas provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the Pandas syntax is:

#!/usr/bin/env python
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
data_frame.to_csv(output_file, index=False)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_pandas.py”, run the same chmod command on this file, and then substitute this Python script in the last command shown above:
./read_csv_pandas.py winequality-red.csv output/output_Python_Pandas.csv

Julia
Now let’s create a Julia script to perform the same operations. To create the Julia script, open a text editor and enter the following code:

#!/usr/bin/env julia
# For more information, visit: cbrownley.wordpress.com

# Assign the first command line argument to a variable called input_file
input_file = ARGS[1]
# Assign the second command line argument to a variable called output_file
output_file = ARGS[2]

# Open the output file for writing
out_file = open(output_file, "w")
# Open the input file for reading and close automatically at end
open(input_file, "r") do in_file
    # Use a for loop to process the rows in the input file one-by-one
    for line in eachline(in_file)
        # Write the row of data to the output file
        write(out_file, line)
    # Close the for loop
    end
# Close the input file handle
end
# Close the output file handle
close(out_file)

read_csv_Julia

Once you’ve pasted this code into the file, save the file as “read_csv.jl” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.jl
./read_csv.jl winequality-red.csv output/output_Julia.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Julia.csv in the output folder.

A popular Julia package for reading and managing tabular data, especially when the data may contain NAs, is DataFrames. DataFrames provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the DataFrames syntax is:

#!/usr/bin/env julia
using DataFrames

input_file = ARGS[1]
output_file = ARGS[2]

data_frame = readtable(input_file, separator = ',')
writetable(output_file, data_frame)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_frames.jl”, run the same chmod command on this file, and then substitute this Julia script in the last command shown above:
./read_csv_data_frames.jl winequality-red.csv output/output_Julia_DataFrames.csv

folder_structure_all_files

As you can see, when it comes to reading, processing, and writing CSV files, the differences in syntax between Python and Julia are very slight. For example, Python’s “with open()” statements are “open() do … end” statements in Julia, and for loops in Julia drop the colon required in Python and instead require the end keyword. These differences are so minor that I’ve found it very easy to pick up Julia syntax and transition back and forth between Python and Julia.

Now that we know how to read and write all of the data in a CSV-formatted input file with R, Python, and Julia, the next step is to figure out how to filter for specific rows and columns in these languages. Then we can move on to processing lots of files in a directory and also dealing with Excel files. We’ll cover these topics in future posts.

Getting into Class with Python, Flowers, and Dictionaries

A few weeks ago, I received an email inviting me to sign up for a programming-based training class. I clicked on the link to sign up and was directed to a registration page. The registration page included the usual fields, e.g. name and email, and it also required you to submit your code for two programming questions. I enjoyed having to write two snippets of Python code to sign up for the class, so this post is about these two small scripts.

Both questions were based on the ubiquitous Iris data set, which includes data on the sepal length, sepal width, petal length, and petal width for three species of flower, i.e. versicolor, setosa, and virginica. The file contains a header row and fifty rows of data for each species of flower:

iris_data_set
The first question asked you to count the number of rows associated with each type of flower. The second question asked you to calculate the average petal length for each type of flower. Let’s discuss these two questions in turn.

Question #1: Count the number of rows associated with each type of flower

Let’s take a look at the Python code I submitted to answer this question, and then I’ll explain the code:

#!/usr/bin/env python
from string import strip
import sys
input_file = sys.argv[1]
all_species = {}
with open(input_file, 'r') as file_handle:
    header = file_handle.readline()
    for row in file_handle:
        row_list = row.strip().split(',')
        species = row_list[4]
        if species not in all_species:
            all_species[species] = 1
        else:
            all_species[species] += 1

for species, count in all_species.items():
    print "%s: %d" % (species, count)

Whenever I’m asked to group or bin data into categories I consider using a Python dictionary (a.k.a. hash or associative array). Grouping or binning data in a dictionary is helpful because you don’t have to specify the number of categories ahead of time (i.e. new categories are added as keys in the dictionary as they’re encountered in the data) and you’re guaranteed to have distinct categories (i.e. dictionary keys must be unique).

Line one is the shebang line that instructs Unix and Mac computers where to find the Python program that will be used to run this code. Line two imports the strip function from Python’s built-in string module so we can remove the newline character from the end of each row of input. Line three imports the built-in sys module so we can use it to pass arguments from the command line into the script.

Line four uses the sys module to capture the first argument on the command line after the name of my script and assign the value to a variable called input_file. Since I named my script species_count.py and the input file is called iris.csv, the complete input for the command line is: python species_count.py iris.csv

Line five uses curly braces to create an empty dictionary called all_species.

Line six uses a with statement to open the input file for reading, ‘r’, and creates a file handle for reading the data in the file. Line seven uses the readline function to read the first row of data in the input file, i.e. the header row, into a variable called header. I need to remove the header row because I only need the data rows to answer the question.

Line eight is a for loop that loops over the data rows, one row at a time; therefore, lines nine to fourteen occur for every row in the input file (You can also tell that this is the case because lines nine to fourteen are indented beneath line eight). When each data row enters line nine it enters as a string, so first the strip function removes the newline character from the end of the string and then the split function converts the string into a list (a.k.a. array) by splitting the string on commas, which are the column delimiters. The resulting list is assigned to a variable called row_list.

Line ten uses list indexing to capture the fifth element in the list, i.e. the type of flower, and assign the value to a variable called species. Remember, index values start at zero, i.e. 0, 1, 2, 3, 4, …, so index number four refers to the fifth element in the list.

Line eleven is an if-else statement that tests whether the value in the variable species, i.e. the type of flower, is not already a key in the dictionary called all_species. If it is not, then line twelve adds the value as a key in the dictionary and assigns the key a value of one. If the value in the variable species is already a key in the dictionary, then line twelve is skipped and line fourteen adds one to the value associated with the key.

Let’s discuss a specific example to make sure the operations in lines eleven to fourteen are clear. Let’s say the first two data rows in the input file are for setosa flowers. When the first data row is processed, there are no keys in the dictionary so line eleven will be true and line twelve will add “setosa” as a key in the all_species dictionary and set the associated value to one. When the second data row, which is also a setosa row, is processed, “setosa” is already a key in the dictionary so line twelve is skipped and line fourteen adds one to the value associated with the key “setosa”.

Finally, lines sixteen and seventeen are a for loop and print statement for looping over the dictionary’s keys and values and printing them to the screen. The for loop uses the dictionary’s items function to unpack the dictionary’s keys and values into the variables species and count. Then line sixteen uses string formatting to format the value in species, i.e. the type of flower, as a string, %s, and the value in count as an integer, %d.

The following is the result of running this script on the input file:

species_count
As you can see, there are 50 data rows for each type of flower.

Question #2: Calculate the average petal length for each type of flower

Let’s take a look at the Python code I submitted to answer this question, and then I’ll explain the code:

#!/usr/bin/env python
from string import strip
import sys
input_file = sys.argv[1]
all_species = {}
with open(input_file, 'r') as file_handle:
    header = file_handle.readline()
    for row in file_handle:
        row_list = row.strip().split(',')
        petal_length = row_list[2]
        species = row_list[4]
        if species not in all_species:
            all_species[species] = [float(petal_length), 1]
        else:
            all_species[species][0] += float(petal_length)
            all_species[species][1] += 1

for species, values in all_species.items():
    print '%s: %0.2f' % (species, values[0]/values[1])

Lines one to nine are identical to the lines of code we’ve already discussed, so I’ll focus on the new lines of code I used to calculate the average petal length for each type of flower.

Lines ten and eleven capture two specific elements from each data row once the row has been converted into a list variable. Line ten captures the third element in the list, i.e. the petal length, and line eleven captures the fifth element, i.e. the type of flower.

Line twelve is identical to line eleven in the previous script we discussed. It is an if-else statement that tests whether the value in the variable species, i.e. the type of flower, is not already a key in the dictionary called all_species. If it is not, then line thirteen adds the value as a key in the dictionary and assigns a list with two elements as the key’s associated value. The first element in the list is the value in the variable petal_length, converted to a floating-point number with the float function, and the second element is the number one.

If the value in the variable species is already a key in the dictionary, then line thirteen is skipped and lines fifteen and sixteen add the floating-point version of the value in the variable petal-length to the first element in the list, [0], associated with the key and add one to the second element in the list, [1], associated with the key, respectively.

Lines twelve to sixteen ensure that after all of the data rows are processed the dictionary contains three keys, i.e. for the three species of flowers in the input file, and each key’s value is a list with two elements. The first element in the list contains the sum of all of the petal lengths for that type of flower and the second element in the list contains the number of rows associated with that type of flower. Next, we’ll use these two values to calculate the average petal length for each type of flower.

Lines eighteen and nineteen are a for loop and print statement for looping over the dictionary’s keys and values and printing them to the screen. The for loop uses the dictionary’s items function to unpack the dictionary’s keys and values into the variables species and values. Line nineteen uses string formatting to format the two values passed into the print statement from the parentheses. The first value is the name of the species, formatted as a string, %s. The second value is the average petal length for the species, which is the result of dividing the two elements in each list and formatting the result as a floating-point number rounded to two decimal places, %0.2f.

The following is the result of running this script on the input file:

iris_avg_petal_length
It was a lot of fun creating these two scripts to register for the class. Despite being short, they illustrate how you can use a Python dictionary to group or bin data into unique categories and how to calculate basic statistics. If you’re new to Python or the dictionary data structure, try downloading the Iris data set or another text or CSV file from the Internet and re-creating these scripts to practice organizing data into a dictionary. Once you get the hang of it, you’ll have another powerful data structure you can use to organize and analyze data. Have fun!

Too Many Files? Feed Them to Python’s Voracious Glob Module

In my previous post I mentioned that a coworker had recently emailed me a folder full of over two hundred Excel files and asked me to extract some relevant data from each file. I noted how undertaking that task manually would have been time-consuming and error-prone and described how exhilarating it was to accomplish the task quickly by writing some Python code.

I didn’t show how to process multiple files with Python in that post because it is easier to understand the code for processing multiple files once you’re familiar with code for processing one file. So that’s why in that post I demonstrated how to read and write a single CSV file with Python. With that knowledge under our belts, we’re now prepared to understand Python code for processing multiple CSV files.

One good way to learn to code in Python is to create small datasets on your laptop and then write a Python script to process or manipulate them in some way, so that’s what we’ll do here. The following example demonstrates one way to read multiple CSV files (with similarly formatted data), concatenate the data from the files, and write the results to an output file.

One assumption I make in this example is that you’ve already visited http://www.python.org/ and downloaded and installed the version of Python that is compatible with your computer’s operating system. Another assumption is that all of the input files are located in the same folder. Also, unlike in my previous post, the script in this example can handle commas embedded in column values because it imports Python’s built-in csv module, which makes it easier to handle numbers with embedded commas, e.g. $1,563.25.

Ok, in order to process multiple CSV files we need to create multiple CSV files. Open Microsoft Excel and add the following data:

Figure I: 1st Input File - sales_january2014.csv

Figure I: 1st Input File – sales_january2014.csv

Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “sales_january2014”. Click ‘Save’.

Ok, that’s one input file. Now let’s create a second input file. Open a new Excel workbook and add the following data:

Figure II: 2nd Input File - sales_february2014.csv

Figure II: 2nd Input File – sales_february2014.csv

Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “sales_february2014”. Click ‘Save’. Ok, now we have two CSV input files, one for January and one for February. We’ll stick with two input files in this example to keep it simple, but please keep in mind that the code in this example can handle many more files; that is, it will scale well.

Now that we have two CSV files to work with, let’s create a Python script to read the files and write their contents to an output file. Open your favorite text editor (e.g. Notepad) and add the following lines of code:

#!/usr/bin/python
import csv
import glob
import os
import sys

input_path = sys.argv[1]
output_file = sys.argv[2]

filewriter = csv.writer(open(output_file,’wb’))
file_counter = 0
for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
        with open(input_file,’rU’) as csv_file:
                filereader = csv.reader(csv_file)
                if file_counter < 1:
                        for row in filereader:
                                filewriter.writerow(row)
                else:
                        header = next(filereader,None)
                        for row in filereader:
                                filewriter.writerow(row)
        file_counter += 1

The first line is a comment line that makes the script transferable across operating systems. The next four lines import additional built-in Python modules so that we can use their methods and functions. You can read more about these and other built-in modules at: http://docs.python.org/2/library/index.html.

The sixth line uses argv from the sys module to grab the first piece of information after the script name on the command line, the path to and name of the input folder, and assigns it to the variable input_path. Similarly, the seventh line grabs the second piece of information after the script name, the path to and name of the output file, and assigns it to the variable output_file.

The eighth line uses the csv module to open the output file in write ‘w’ mode and create a writer object, filewriter, for writing to the output file. The ‘b’ enables a distinction between binary and text files for systems that differentiate between binary and text files, but for systems that do not, the ‘b’ has no effect. The ninth line creates a variable, file_counter, to store the count of the number of files processed and initializes it to zero.

The tenth line creates a list of the input files to be processed and also starts a “for” loop for looping through each of the input files. There is a lot going on in this one line, so let’s talk about how it works. os.path.join joins the two components between its parentheses. input_path is the path to the folder that contains the input files and ‘*.csv’ represents any file name that ends in ‘.csv’.

glob.glob expands the asterisk ‘*’, a Unix Shell wildcard character, in ‘*.csv’ into the actual file name. Together, glob.glob and os.path.join create a list of our two input files, e.g. [‘C:\Users\Clinton\Desktop\sales_january2014.csv’, ‘C:\Users\Clinton\Desktop\sales_february2014.csv’]. Finally, the “for” loop syntax executes the lines of code beneath this line for each of the input files in this list.

The eleventh line uses a “with” statement to open each input file in read ‘r’ mode. The ‘U’ mode helps recognize newlines in case your version of Python is built without universal newlines. The twelfth line uses the csv module to create a reader object, filereader, for reading each input file.

The thirteenth line creates an “if-else” statement for distinguishing between the first input file and all subsequent input files. The first time through the “for” loop file_counter equals zero, which is less than one, so the “if” block is executed. The code in the “if” block writes every row of data in the first input file, including the header row, to the output file.

At the bottom of the “for” loop, after processing the first input file, we add one to file_counter. Therefore, the second time through the “for” loop file_counter is not less than one, so the “else” block is executed. The code in the “else” block uses the csv module’s next() method to read the first row, i.e. the header row, of the second and subsequent input files into the variable, header, so that it is not written to the output file. The remaining code in the “else” block writes the remaining rows in the input file, the rows of data beneath the header row, to the output file.

Now that we understand what the code is supposed to do, let’s save this file as a Python script and use it to process our two input files. To save the file as a Python script, open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘All Files’ so that the dialog box doesn’t select a specific file type. Finally, in the ‘Save As’ or ‘File Name’ box, type process_many_csv_files.py. Click ‘Save’. Now you have a Python script you can use to process multiple CSV files.

Figure III: Python Script - process_many_csv_files.py

Figure III: Python Script – process_many_csv_files.py

To use process_many_csv_files.py to read and write the contents of our two input files, open a Command Prompt (Windows) or Terminal (Mac) window. When the window opens the prompt will be in a particular folder, also known as a directory (e.g. “C:\Users\Clinton\Documents”). The next step is to navigate to the Desktop, where we saved the Python script.

To move between folders, you can use the ‘cd’ command, a Unix command which stands for change directory. To move up and out of the ‘Documents’ folder into the ‘Clinton’ folder, type the following and then hit Enter:

cd ..

That is, the letters ‘cd’ together followed by one space followed by two periods. The two periods ‘..’ stand for up one level. At this point, the prompt should look like “C:\Users\Clinton”. Now, to move down into a specific folder you use the same ‘cd’ command followed by the name of the folder you want to move into. Since the ‘Desktop’ folder resides in the ‘Clinton’ folder, you can move down into the ‘Desktop’ folder by typing the following and then hitting Enter:

cd Desktop

At this point, the prompt should look like “C:\Users\Clinton\Desktop” in the Command Prompt and we are exactly where we need to be since this is where we saved the Python script and two CSV input files. The next step is to run the Python script.

To run the Python script, type one of the following commands on the command line, depending on your operating system, and then hit Enter:

Windows:
python process_many_csv_files.py . sales_summary.csv

That is, type python, followed by a single space, followed by process_many_csv_files.py, followed by a single space, followed by a single period, followed by a single space, followed by sales_summary.csv, and then hit Enter. The single period refers to the current directory, your Desktop (i.e. the folder that contains your two input files).

Mac:
chmod +x process_many_csv_files.py
./process_many_csv_files.py . sales_summary.csv

That is, type chmod, followed by a single space, followed by +x, followed by a single space, followed by process_many_csv_files.py, and then hit Enter. This command makes the Python script executable. Then type ./process_many_csv_files.py, followed by a single space, followed by a single period, followed by a single space, followed by sales_summary.csv, and then hit Enter:

After you hit Enter, you should not see any new output in the Command Prompt or Terminal window. However, if you minimize all of your open windows and look at your Desktop there should be a new CSV file called sales_summary.csv. Open the file. The contents should look like:

Figure IV: Output File - sales_summary.csv

Figure IV: Output File – sales_summary.csv

As you can see, a single header row and the six rows of data from the two input files were successfully written to the output file, sales_summary.csv. Often, this procedure of concatenating multiple input files into a single output file is all you need to do to begin your analysis. However, sometimes you may not need all of the rows or columns in the output file. Or you may need to modify the data or perform a calculation before writing it to the output file. In many cases, you would only need to make slight modifications to the code discussed above to alter the data written to your output file.

In this example there were only two input files, but the code generalizes to basically as many input files as your computer can handle. So if you need to concatenate the data in a few dozen, hundred, or thousand CSV files the code is basically re-useable as-is. This ability to automate and scale repetitive procedures on files and data is one of the great advantages of learning to code (even a little bit).

Now you know that it only takes a few lines of code to automate a process that, given a larger number of files, would be time-consuming and error-prone or even infeasible. Being able to quickly accomplish a task that would be difficult or impossible to do manually is empowering. Add in the benefit of eliminating manual “copy/paste” errors (once you debug your code), and the new capability is really exciting. Having read this post, I hope you’re now more familiar with Python and eager to begin using it. If you have any questions, please reply to this post.