Lives On The [Fault] Line: A Geospatial Analysis of the San Andreas Fault in Python

Have you watched the action-packed movie, San Andreas? In the movie, California’s San Andreas Fault triggers a devastating, magnitude nine earthquake, the largest in the state’s history. Rather than critique the movie’s thrilling scenes, in this post, I’d like to explore answers to the question, how many people live within ¼, ½, ¾, and 1 mile of the fault, to demonstrate conducting a geospatial analysis in Python.

To start, let’s set up a dedicated analysis environment and download the input data, including shapefiles for California’s census tracts and the San Andreas Fault, as well as 2016 population data for the census tracts.

Project Environment

To create a dedicated analysis environment, let’s create a new folder for our files and an isolated Python environment with conda or virtualenv. The following commands create a new folder named san_andreas and activate an isolated Python 2.7 environment named geo that contains packages we’ll need, such as pandas, matplotlib, and geopandas.

mkdir san_andreas
cd san_andreas

conda create -n geo python=2.7 pandas matplotlib gdal geopandas ipython jupyter notebook

source activate geo

Now that we’ve set up our analysis environment, let’s download the shapefiles and population data we’ll need for our analysis.

Data

Download Quaternary Faults Shapefile

The geographic data for the San Andreas Fault are available within a shapefile available at the U.S. Geological Survey: https://earthquake.usgs.gov/hazards/qfaults/ The shapefile contains information on many large faults and associated folds in the United States, so we’ll have to extract the specific records associated with the San Andreas fault. For now, let’s download and unzip the USGS’s file with the following two commands:

curl 'https://earthquake.usgs.gov/static/lfs/nshm/qfaults/qfaults.zip' -o qfaults.zip
unzip qfaults.zip -d qfaults

Citation: U.S. Geological Survey and California Geological Survey, 2006, Quaternary fault and fold database for the United States, accessed Jan 4, 2018, from USGS web site: https://earthquake.usgs.gov/hazards/qfaults/

qfaults

Download California Census Tracts Shapefile

The geographic data for California’s census tracts are available within a shapefile available at the U.S. Census Bureau: https://www2.census.gov/geo/tiger/GENZ2016/shp/ The Census Bureau provides cartographic boundary files, simplified representations of geographic areas, for various geographies, such as state, county, census tract, legislative district, school district, and block group. We’ll conduct our analysis at the census tract level so our measurements occur over relatively small geographic areas, but we’ll present our results at the county level since people are more familiar with California’s counties. Let’s download and unzip the Census Bureau’s file with the following two commands:

curl 'https://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_06_tract_500k.zip' -o cb_2016_06_tract_500k.zip
unzip cb_2016_06_tract_500k.zip -d cb_2016_06_tract_500k

The filename describes the data in the file. It’s a cartographic boundary (cb_) file from 2016 (2016_) for the State of California (06_) at the census tract level (tract_) at a resolution level of 1:500,000 (500k).

california_census_tracts

Download California Census Tracts’ Populations

The 2016 population data for California’s census tracts are available from the U.S. Census Bureau’s American Community Survey. You can use the American FactFinder’s Guided Search (https://factfinder.census.gov/faces/nav/jsf/pages/guided_search.xhtml) to download the data, or you can use the following command:

curl 'https://api.census.gov/data/2016/acs/acs5?get=B01003_001E&for=tract:*&in=state:06' -o cb_2016_06_tract_B01003.json

The API call describes the data we’re requesting. We’re requesting 2016 total population estimates (B01003_001E) for California’s (state:06) census tracts (tract:*) from the American Community Survey’s 2012-2016 5-year Estimates (acs5). You can view additional API call examples for the ACS’s 5-year estimates at: https://api.census.gov/data/2016/acs/acs5/examples.html

california_census_tracts_population

Geospatial Analysis

We’re finally ready to begin our analysis and estimate how many people live within ¼, ½, ¾, and 1 mile of the San Andreas Fault! To begin, let’s open IPython or a Jupyter Notebook and import the packages we’ll need:

ipython

Let’s import geopandas, pandas, and matplotlib. We’ll need geopandas to read and write spatial data, manage data projections (i.e. mapping coordinates to locations on Earth), and to merge, manipulate, and aggregate spatial data. We’ll need pandas to read the population data and to select, merge, and manage multiple data files. We’ll need matplotlib to create plots of the data and geometries.

from geopandas import read_file
import pandas as pd
import matplotlib.pyplot as plt

Process San Andreas Fault Shapefile

Now we can import, select, and clean the data associated with the San Andreas Fault. We’ll use geopandas’ read_file function to read the shapefile. The file contains data for several large faults and folds in the United States, so let’s search for “san andreas” in the faultname column to filter for the data associated with the San Andreas Fault. The file also includes more columns than we need, so let’s select and rename the columns we want to retain. Finally, let’s use geopandas’ to_crs function to project the data to EPSG:3310, California Albers, which is appropriate for displaying and calculating distances in California.

qfaults = read_file('/Users/clinton/Downloads/qfaults/qfaults.shp')
san_andreas = qfaults.loc[qfaults['faultname'].str.contains('san andreas', case=False), :]
san_andreas_columns_to_keep = ['fault_id', 'section_id', 'faultname', 'sectionnam', 'geometry']
san_andreas = san_andreas[san_andreas_columns_to_keep]
san_andreas.columns = ['fault_id', 'section_id', 'fault_name', 'section_name', 'geometry']
san_andreas = san_andreas.to_crs('+init=epsg:3310')

san_andreas_fault

Create San Andreas Fault Buffers

We’re going to create buffers of varying distances around the San Andreas linestring to calculate the amount of overlap between each buffer and census tract. We’ll use this amount of overlap to estimate the portion of the population in each census tract that’s within a specific distance of the fault. Since we’re going to create several buffers, let’s write a function to create the buffers.

Inside the function, create_mp_buffer, we use geopandas’ buffer method to create a buffer around the San Andreas linestring that’s a specific number of meters away from the linestring’s coordinates. Once we’ve created this new set of geometries, we use geopandas’ unary_union method to combine them into a single multipolygon.

def create_mp_buffer(geo, meters):
    segments_with_buffers = geo.buffer(meters)
    multi_polygon = segments_with_buffers.unary_union
    return multi_polygon

Now that we have a function to create buffers around the San Andreas Fault, let’s use it to create buffers that are ¼, ½, ¾, and 1 mile away from the fault. The function uses meters instead of miles, so the numbers in the functions are the respective distances in meters.

quarter_mile = create_mp_buffer(san_andreas, 402.336)
half_mile = create_mp_buffer(san_andreas, 804.672)
three_quarter_mile = create_mp_buffer(san_andreas, 1207.008)
one_mile = create_mp_buffer(san_andreas, 1609.34)

san_andreas_buffers_in_san_mateo_and_san_bernardino_counties

Process California Census Tracts Shapefile

Now that we’ve processed the fault data, let’s turn our attention to the California census tracts data. The processing is similar to the fault data. We read the shapefile, rename the columns, convert the county and tract IDs to integers, and project the data to EPSG:3310, California Albers. We need to convert the county and tract IDs to a specific data type to facilitate the merge between these data and the population data. Finally, we need to project these data to California Albers because all of our geographic data need to be in the same projection to ensure our geometric manipulations, set operations, and distance calculations are correct for our area of interest, California.

ca_tracts = read_file('/Users/clinton/Downloads/cb_2016_06_tract_500k/cb_2016_06_tract_500k.shp')
ca_tracts.columns = ['state_id', 'county_id', 'tract_id', 'aff_geo_id', 'geo_id', 'tract_id_float', 'lsad', 'land_area', 'water_area', 'geometry']
ca_tracts['county_id'] = ca_tracts.county_id.astype(int)
ca_tracts['tract_id'] = ca_tracts.tract_id.astype(int)
ca_tracts = ca_tracts.to_crs('+init=epsg:3310')

Process California Census Tracts’ Populations

Now we can turn our attention to the population data. Let’s use pandas to read the data into a data frame, skipping the first row and selecting the population, county ID, and tract ID columns. Finally, let’s convert the population data into floating-point numbers and the county and tract IDs into integers to facilitate calculations and data frame merges, respectively.

ca_tracts_population = pd.read_json('/Users/clinton/Downloads/cb_2016_06_tract_500k/cb_2016_06_tract_B01003.json')
ca_tracts_population = ca_tracts_population.iloc[1:,[0,2,3]]
ca_tracts_population.columns = ['population_2016', 'county_id', 'tract_id']
ca_tracts_population['population_2016'] = ca_tracts_population.population_2016.astype(float)
ca_tracts_population['county_id'] = ca_tracts_population.county_id.astype(int)
ca_tracts_population['tract_id'] = ca_tracts_population.tract_id.astype(int)

Merge California Census Tracts and Populations

Now that we have a GeoDataFrame with California’s census tracts and a separate DataFrame with the census tracts’ 2016 population values, let’s merge the two data frames so all of the data are in one GeoDataFrame. Since there are similar tract ID numbers for different counties, e.g. county 1 tract 1 and county 2 tract 1, we need to merge the data frames on both county ID and tract ID.

ca_tracts_merged = ca_tracts.merge(ca_tracts_population, on=['county_id', 'tract_id'])

Calculate Populations In The ¼, ½, ¾, and 1 Mile Buffers

Now that we have our California census tracts data and our four San Andreas Fault buffers, let’s calculate, for each buffer region, how much of the buffer area overlaps with each census tract area. Then we can multiply the amount of area overlap by the census tract population to estimate the number of people in the census tract who live within that distance of the San Andreas Fault.

This calculation assumes the population is evenly distributed across the census tract, which isn’t necessarily true, so the result is only an approximation. At the same time, we’re using census tracts instead of counties for this calculation because, since their geographic areas are smaller, the error in this assumption shouldn’t be as great as it would be with counties.

The following for loop iterates over the four fault buffers (i.e. ¼, ½, ¾, and 1 mile from the fault) and, for each one, calculates the area of intersection between the buffer and each census tract, divides the intersection area by the census tract area to calculate the fraction of the census tract area contained in the intersection, and then multiplies this decimal number by the census tract population to estimate the number of people who live within the specified distance from the fault. The code also adds all of these calculated geometries and values as columns in a new GeoDataFrame named merged.

overlap_mps = [quarter_mile, half_mile, three_quarter_mile, one_mile]
overlap_mps_str = ['quarter_mile', 'half_mile', 'three_quarter_mile', 'one_mile']

for idx, mp in enumerate(overlap_mps):
    overlap = ca_tracts_merged['geometry'].intersection(mp)
    overlap.name = overlap_mps_str[idx]
    if idx == 0:
        merged = ca_tracts_merged.join(overlap)
        merged['tract_area'] = merged.geometry.area
    else:
        merged = merged.join(overlap)
    merged[overlap_mps_str[idx]+'_buffer_area'] = [geo.area for geo in merged[overlap_mps_str[idx]]]
    merged[overlap_mps_str[idx]+'_pct_overlap'] = merged[overlap_mps_str[idx]+'_buffer_area'] /     merged['tract_area']
    merged[overlap_mps_str[idx]+'_affected_pop'] = [round(val) for val in     merged[overlap_mps_str[idx]+'_pct_overlap'] * merged['population_2016']]

Up to this point, we’ve been focused on the census tracts so we haven’t concerned ourselves with having easy-to-read county names. However, since people are more familiar with counties than census tracts, let’s map the county IDs to county names so we can present the results at the county level. Let’s extract the county ID from the geo_id and then create a new column named county that contains the county name, mapped from a dictionary that associates county IDs with county names. The comment line shows where we need to create the dictionary, but I’m going to provide the dictionary at the bottom of this post because it’s long and may be distracting here.

merged['county_id'] = merged['geo_id'].str.slice(2,5)
# CREATE county_mapping HERE
merged['county'] = merged['county_id'].apply(lambda id: county_mapping[id])

Aggregate Data To County Level

Now that we have a column of county names, we can use geopandas’ dissolve function to aggregate the data from the census tract level to the county level. We’ll use the sum function to sum the population values for each of the distances from the fault within each county.

counties = merged.dissolve(by='county', aggfunc='sum')

Results

We’re finally in a position to explore answers to the question that prompted this analysis, namely, how many people live within ¼, ½, ¾, and 1 mile from the San Andreas Fault! First, let’s review the state-wide results. The results suggest that approximately 120,000 people live within ¼ mile, 209,000 people live within ½ mile, 300,000 people live within ¾ mile, and 389,000 people live within one mile of the fault.

counties.loc[counties['one_mile_affected_pop'] > 0.0, ['quarter_mile_affected_pop', 'half_mile_affected_pop', 'three_quarter_mile_affected_pop', 'one_mile_affected_pop']].sum()

population_within_distances_of_san_andreas_fault

Next, let’s review the results by county, for counties where the approximate number of people living within one mile of the fault is greater than 1,000. The results suggest the four counties with the most people living close to the fault are San Mateo, San Bernardino, Los Angeles, and Riverside, with the close populations numbering in the tens of thousands. The remaining counties with close populations over 1,000 include Santa Cruz, Santa Clara, Kern, San Benito, San Luis Obispo, Sonoma, Marin, and Monterey.

counties.loc[counties['one_mile_affected_pop'] > 1000.0, ['quarter_mile_affected_pop', 'half_mile_affected_pop', 'three_quarter_mile_affected_pop', 'one_mile_affected_pop']].sort_values(by=['one_mile_affected_pop'], ascending=False)

populations_within_distances_of_san_andreas_fault_by_county

Conclusion

This post explored the question of how many people live within ¼, ½, ¾, and 1 mile from the San Andreas Fault to demonstrate how to use geopandas to conduct a geospatial analysis in Python. The post is meant to illustrate the functionality you can use to explore interesting geospatial questions, rather than provide robust answers to this specific question. There are many other applications for this type of analysis, e.g. exploring the number of people or houses near a coastline, a roadway or transit line, or a utility line. I hope this post has piqued your interest in conducting your own geospatial analysis. If you do have an example to share, please share it because I enjoy reading about others’ projects. Thank you for reading!

San Andreas Fault

By IkluftOwn work, GFDL, Link

county_mapping = {
'001': 'Alameda',
'003': 'Alpine',
'005': 'Amador',
'007': 'Butte',
'009': 'Calaveras',
'011': 'Colusa',
'013': 'Contra Costa',
'015': 'Del Norte',
'017': 'El Dorado',
'019': 'Fresno',
'021': 'Glenn',
'023': 'Humboldt',
'025': 'Imperial',
'027': 'Inyo',
'029': 'Kern',
'031': 'Kings',
'033': 'Lake',
'035': 'Lassen',
'037': 'Los Angeles',
'039': 'Madera',
'041': 'Marin',
'043': 'Mariposa',
'045': 'Mendocino',
'047': 'Merced',
'049': 'Modoc',
'051': 'Mono',
'053': 'Monterey',
'055': 'Napa',
'057': 'Nevada',
'059': 'Orange',
'061': 'Placer',
'063': 'Plumas',
'065': 'Riverside',
'067': 'Sacramento',
'069': 'San Benito',
'071': 'San Bernardino',
'073': 'San Diego',
'075': 'San Francisco',
'077': 'San Joaquin',
'079': 'San Luis Obispo',
'081': 'San Mateo',
'083': 'Santa Barbara',
'085': 'Santa Clara',
'087': 'Santa Cruz',
'089': 'Shasta',
'091': 'Sierra',
'093': 'Siskiyou',
'095': 'Solano',
'097': 'Sonoma',
'099': 'Stanislaus',
'101': 'Sutter',
'103': 'Tehama',
'105': 'Trinity',
'107': 'Tulare',
'109': 'Tuolumne',
'111': 'Ventura',
'113': 'Yolo',
'115': 'Yuba'
}

Advertisements

Data Wrangling in Julia based on dplyr Flights Tutorials

A couple of my favorite tutorials for wrangling data in R with dplyr are Hadley Wickham’s dplyr package vignette and Kevin Markham’s dplyr tutorial. I enjoy the tutorials because they concisely illustrate how to use a small set of verb-based functions to carry out common data wrangling tasks.

I tend to use Python to wrangle data, but I’m exploring the Julia programming language so I thought creating a similar dplyr-based tutorial in Julia would be a fun way to examine Julia’s capabilities and syntax. Julia has several packages that make it easier to deal with tabular data, including DataFrames and DataFramesMeta.

The DataFrames package provides functions for reading and writing, split-apply-combining, reshaping, joining, sorting, querying, and grouping tabular data. The DataFramesMeta package provides a set of macros that are similar to dplyr’s verb-based functions in that they offer a more convenient, readable syntax for munging data and chaining together multiple operations.

Data

For this tutorial, let’s following along with Kevin’s tutorial and use the hflights dataset. You can obtain the dataset from R with the following commands or simply download it here: hflights.csv

install.packages("hflights")
library(hflights)
write.csv(hflights, "hflights.csv")

Load packages and example dataset

To begin, let’s start the Julia REPL, load the DataFrames and DataFramesMeta packages, and load and inspect the hflights dataset:

using DataFrames
using DataFramesMeta

hflights = readtable("/Users/clinton/Documents/Julia/hflights.csv");
size(hflights)
names(hflights)
head(hflights)
describe(hflights)

hflights1b

The semicolon on the end of the readtable command prevents it from printing the dataset to the screen. The size command returns the number of rows and columns in the dataset. You can specify you only want the number of rows with size(hflights, 1) or columns with size(hflights, 2). This dataset contains 227,496 rows and 21 columns. The names command lists the column headings. By default, the head command prints the header row and six data rows. You can specify the number of data rows to display by adding a second argument, e.g. head(hflights, 10). The describe command prints summary statistics for each column.

@where: Keep rows matching criteria

AND: All of the conditions must be true for the returned rows

# Julia DataFrames approach to view all flights on January 1
hflights[.&(hflights[:Month] .== 1, hflights[:DayofMonth] .== 1), :]

# DataFramesMeta approach
@where(hflights, :Month .== 1, :DayofMonth .== 1)

hflights2

Julia’s DataFrames’ row filtering syntax is similar to R’s syntax. To specify multiple AND conditions, use “.&()” and place the filtering conditions, separated by commas, between the parentheses. Like dplyr’s filter function, DataFramesMeta’s @where macro simplifies the syntax and makes the command easier to read.

OR: One of the conditions must be true for the returned rows

# Julia DataFrames approach to view all flights where either AA or UA is the carrier
hflights[.|(hflights[:UniqueCarrier] .== "AA", hflights[:UniqueCarrier] .== "UA"), :]

# DataFramesMeta approach
@where(hflights, .|(:UniqueCarrier .== "AA", :UniqueCarrier .== "UA"))

hflights3

To specify multiple OR conditions, use “.|()” and place the filtering conditions between the parentheses. Again, the DataFramesMeta approach is more concise.

SET: The values in a column are in a set of interest

# Julia DataFrames approach to view all flights where the carrier is in Set(["AA", "UA"])
carriers_set = Set(["AA", "UA"])
hflights[findin(hflights[:UniqueCarrier], carriers_set), :]

# DataFramesMeta approach
@where(hflights, findin(:UniqueCarrier, carriers_set))

hflights4

To filter for rows where the values in a particular column are in a specific set of interest, create a Set with the values you’re interested in and then specify the column and your set of interest in the findin function.

PATTERN / REGULAR EXPRESSION: The values in a column match a pattern

# Julia DataFrames approach to view all flights where the carrier matches the regular expression r"AA|UA"
carriers_pattern = r"AA|UA"
hflights[[ismatch(carriers_pattern, String(carrier)) for carrier in hflights[:UniqueCarrier]], :]

# DataFramesMeta approach
@where(hflights, [ismatch(carriers_pattern, String(carrier)) for carrier in :UniqueCarrier])

hflights5

To filter for rows where the values in a particular column match a pattern, create a regular expression and then use it in the ismatch function in an array comprehension.

@select: Pick columns by name

# Julia DataFrames approach to selecting columns
hflights[:, [:DepTime, :ArrTime, :FlightNum]]

# DataFramesMeta approach
@select(hflights, :DepTime, :ArrTime, :FlightNum)

Julia’s DataFrames’ syntax for selecting columns is similar to R’s syntax. Like dplyr’s select function, DataFramesMeta’s @select macro simplifies the syntax and makes the command easier to read.

# Julia DataFrames approach to selecting columns
# first three columns
hflights[:, 1:3]
# pattern / regular expression
heading_pattern = r"Taxi|Delay"
hflights[:, [ismatch(heading_pattern, String(name)) for name in names(hflights)]]
# startswith
hflights[:, filter(name -> startswith(String(name), "Arr"), names(hflights))]
# endswith
hflights[:, filter(name -> endswith(String(name), "Delay"), names(hflights))]
# contains
hflights[:, filter(name -> contains(String(name), "Month"), names(hflights))]

# AND conditions
hflights[:, filter(name -> startswith(String(name), "Arr") && endswith(String(name), "Delay"), names(hflights))]
# OR conditions
hflights[:, filter(name -> startswith(String(name), "Arr") || contains(String(name), "Cancel"), names(hflights))]

hflights6

# DataFramesMeta approach
# first three columns
@select(hflights, 1:3)
# pattern / regular expression
heading_pattern = r"Taxi|Delay"
@select(hflights, [ismatch(heading_pattern, String(name)) for name in names(hflights)])
# startswith
@select(hflights, filter(name -> startswith(String(name), "Arr"), names(hflights)))
# endswith
@select(hflights, filter(name -> endswith(String(name), "Delay"), names(hflights)))
# contains
@select(hflights, filter(name -> contains(String(name), "Month"), names(hflights)))

# AND conditions
@select(hflights, filter(name -> startswith(String(name), "Arr") && endswith(String(name), "Delay"), names(hflights)))
# OR conditions
@select(hflights, filter(name -> startswith(String(name), "Arr") || contains(String(name), "Cancel"), names(hflights)))

hflights7

# Kevin Markham's multiple select conditions example
# select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
# Julia Version of Kevin's Example
# Taxi or Delay in column heading
mask = [ismatch(r"Taxi|Delay", String(name)) for name in names(hflights)]
# Also include first three columns, i.e. Year, Month, DayofMonth
mask[1:3] = true
@select(hflights, mask)

These examples show you can select columns by position and name, and you can combine multiple conditions with AND, “&&”, or OR, “||”. Similar to filtering rows, you can select specific columns based on a pattern by using the ismatch function in an array comprehension. You can also use contains, startswith, and endswith in the filter function to select columns that contain, start with, or end with a specific text pattern.

“Chaining” or “Pipelining”

In R, dplyr provides, via the magrittr package, the %>% operator, which enables you to chain together multiple commands into a single data transformation pipeline in a very readable fashion. In Julia, the DataFramesMeta package provides the @linq macro and |> symbol to enable similar functionality. Alternatively, you can load the Lazy package and use an @> begin end block to chain together multiple commands.

# Chaining commands with DataFrameMeta’s @linq macro
@linq hflights[find(.!isna.(hflights[:,:DepDelay])), :] |>
@where(:DepDelay .> 60) |>
@select(:UniqueCarrier, :DepDelay)

# Chaining commands with Lazy’s @> begin end block
using Lazy
@> begin
hflights[find(.!isna.(hflights[:,:DepDelay])), :]
@where(:DepDelay .> 60)
@select(:UniqueCarrier, :DepDelay)
end

hflights8

These two blocks of code produce the same result, a DataFrame containing carrier names and departure delays for which the departure delay is greater than 60. In each chain, the first expression is the input DataFrame, e.g. hflights. In these examples, I use the find and !isna. functions to start with a DataFrame that doesn’t contain NA values in the DepDelay column because the commands fail when NAs are present. I prefer the @linq macro version over the @> begin end version because it’s so similar to the dplyr-magrittr syntax, but both versions are more succinct and readable than their non-chained versions. The screen shot shows how to assign the pipeline results to variables.

@orderby: Reorder rows

Both DataFrames and DataFramesMeta provide functions for sorting rows in a DataFrame by values in one or more columns. In the first pair of examples, we want to select the UniqueCarrier and DepDelay columns and then sort the results by the values in the DepDelay column in descending order. The last example shows how to sort by multiple columns with the @orderby macro.

# Julia DataFrames approach to sorting
sort(hflights[find(.!isna.(hflights[:,:DepDelay])), [:UniqueCarrier, :DepDelay]], cols=[order(:DepDelay, rev=true)])

# DataFramesMeta approach (add a minus sign before the column symbol for descending)
@linq hflights[find(.!isna.(hflights[:,:DepDelay])), :] |>
@select(:UniqueCarrier, :DepDelay) |>
@orderby(-:DepDelay)

# Sort hflights dataset by Month, descending, and then by DepDelay, ascending
@linq hflights |>
@orderby(-:Month, :DepDelay)

hflights9

DataFrames provides the sort and sort! functions for ordering rows in a DataFrame. sort! orders the rows, inplace. The DataFrames user guide provides additional examples of ordering rows, in ascending and descending order, based on multiple columns, as well as applying functions to columns, e.g. uppercase, before using the column for sorting.

DataFramesMeta provides the @orderby macro for ordering rows in a DataFrame. Specify multiple column names in the @orderby macro to sort the rows by multiple columns. Use a minus sign before a column name to sort in descending order.

@transform: Add new variables

Creating new variables in Julia DataFrames is similar to creating new variables in Python and R. You specify a new column name in square brackets after the name of the DataFrame and assign it a collection of values, sometimes based on values in other columns. DataFramesMeta’s @transform macro simplifies the syntax and makes the transformation more readable.

# Julia DataFrames approach to creating new variable
hflights[:Speed] = hflights[:Distance] ./ hflights[:AirTime] .* 60
hflights[:, [:Distance, :AirTime, :Speed]]

# Delete the variable so we can recreate it with DataFramesMeta approach
delete!(hflights, :Speed)

# DataFramesMeta approach
@linq hflights |>
@select(:Distance, :AirTime) |>
@transform(Speed = :Distance ./ :AirTime .* 60) |>
@select(:Distance, :AirTime, :Speed)

# Save the new column in the original DataFrame
hflights = @linq hflights |>
@transform(Speed = :Distance ./ :AirTime .* 60)

hflights10

The first code block illustrates how to create a new column in a DataFrame and assign it values based on values in other columns. The second code block shows you can use delete! to delete a column. The third example demonstrates the DataFramesMeta approach to creating a new column using the @transform macro. The last example shows how to save a new column in an existing DataFrame using the @transform macro by assigning the result of the transformation to the existing DataFrame.

@by: Reduce variables to values (Grouping and Summarizing)

dplyr provides group_by and summarise functions for grouping and summarising data. DataFrames and DataFramesMeta also support the split-apply-combine strategy with the by function and the @by macro, respectively. Here Julia versions of Kevin’s summarise examples.

# Julia DataFrames approach to grouping and summarizing
by(hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :],
:Dest,
df -> DataFrame(meanArrDelay = mean(df[:ArrDelay])))

# DataFramesMeta approach
@linq hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :] |>
@by(:Dest, meanArrDelay = mean(:ArrDelay))

hflights11

DataFrames and DataFramesMeta don’t have dplyr’s summarise_each function, but it’s easy to apply different functions to multiple columns inside the @by macro.

@linq hflights |>
@by(:UniqueCarrier,
meanCancelled = mean(:Cancelled), meanDiverted = mean(:Diverted))

@linq hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :] |>
@by(:UniqueCarrier,
minArrDelay = minimum(:ArrDelay), maxArrDelay = maximum(:ArrDelay),
minDepDelay = minimum(:DepDelay), maxDepDelay = maximum(:DepDelay))

hflights12

DataFrames and DataFramesMeta also don’t have dplyr’s n and n_distinct functions, but you can count the number of rows in a group with size(df, 1) or nrow(df), and you can count the number of distinct values in a group with countmap.

# Group by Month and DayofMonth, count the number of flights, and sort descending
# Count the number of rows with size(df, 1)
sort(by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = size(df, 1))), cols=[order(:flight_count, rev=true)])

# Group by Month and DayofMonth, count the number of flights, and sort descending
# Count the number of rows with nrow(df)
sort(by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = nrow(df))), cols=[order(:flight_count, rev=true)])

# Split grouping and sorting into two separate operations
g = by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = nrow(df)))
sort(g, cols=[order(:flight_count, rev=true)])

# For each destination, count the total number of flights and the number of distinct planes
by(hflights[find(.!isna.(hflights[:,:TailNum])),:], :Dest) do df
DataFrame(flight_count = size(df,1), plane_count = length(keys(countmap(df[:,:TailNum]))))
end

hflights13

While these examples reproduce the results in Kevin’s dplyr tutorial, they’re definitely not as succinct and readable as the dplyr versions. Grouping by multiple columns, summarizing with counts and distinct counts, and gracefully chaining these operations are areas where DataFrames and DataFramesMeta can improve.

Other useful convenience functions

Randomly sampling a fixed number or fraction of rows from a DataFrame can be a helpful operation. dplyr offers the sample_n and sample_frac functions to perform these operations. In Julia, StatsBase provides the sample function, which you can repurpose to achieve similar results.


using StatsBase
# randomly sample a fixed number of rows
hflights[sample(1:nrow(hflights), 5), :]
hflights[sample(1:size(hflights,1), 5), :]

# randomly sample a fraction of rows
hflights[sample(1:nrow(hflights), ceil(Int,0.0001*nrow(hflights))), :]
hflights[sample(1:size(hflights,1), ceil(Int,0.0001*size(hflights,1))), :]

hflights14

Randomly sampling a fixed number of rows is fairly straightforward. You use the sample function to randomly select a fixed number of rows, in this case five, from the DataFrame. Randomly sampling a fraction of rows is slightly more complicated because, since the sample function takes an integer for the number of rows to return, you need to use the ceil function to convert the fraction of rows, in this case 0.0001*nrow(hflights), into an integer.

Conclusion

In R, dplyr sets a high bar for wrangling data well with succinct, readable code. In Julia, DataFrames and DataFramesMeta provide many useful functions and macros that produce similar results; however, some of the syntax isn’t as concise and clear as it is with dplyr, e.g. selecting columns in different ways and chaining together grouping and summarizing operations. These are areas where Julia’s packages can improve.

I enjoyed becoming more familiar with Julia by reproducing much of Kevin’s dplyr tutorial. It was also informative to see differences in functionality and readability between dplyr and Julia’s packages. I hope you enjoyed this tutorial and find it to be a useful reference for wrangling data in Julia.

Parsing PDFs in Python with Tika

A few months ago, one of my friends asked me if I could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few Python-based PDF parsers and decided to try Tika, which is a port of Apache Tika.  Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided to write a blog post about parsing PDFs with Tika.

tika

California Budget PDFs

To demonstrate parsing PDFs with Tika, I knew I’d need some PDFs. I was thinking about which ones to use and remembered a blog post I’d read on scraping budget data from a government website. Governments also provide data in PDF format, so I decided it would be helpful to demonstrate how to parse data from PDFs available on a government website. This way, with these two blog posts, you have examples of acquiring government data, even if it’s embedded in HTML or PDFs. The three PDFs we’ll parse in this post are:

2015-16 State of California Enacted Budget Summary Charts
2014-15 State of California Enacted Budget Summary Charts
2013-14 State of California Enacted Budget Summary Charts

ca_budget

Each of these PDFs contains several tables that summarize total revenues and expenditures, general fund revenues and expenditures, expenditures by agency, and revenue sources. For this post, let’s extract the data on expenditures by agency and revenue sources. In the 2015-16 Budget PDF, the titles for these two tables are:

2015-16 Total State Expenditures by Agency

expenditures

2015-16 Revenue Sources

revenues

To follow along with the rest of this tutorial you’ll need to download the three PDFs and ensure you’ve installed Tika. You can download the three PDFs here:

http://www.ebudget.ca.gov/2015-16/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2014-15/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2013-14/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

You can install Tika by running the following command in a Terminal window:

pip install --user tika

IPython

Before we dive into parsing all of the PDFs, let’s use one of the PDFs, 2015-16CABudgetSummaryCharts.pdf, to become familiar with Tika and its output. We can use IPython to explore Tika’s output interactively:

ipython

from tika import parser

parsedPDF = parser.from_file("2015-16CABudgetSummaryCharts.pdf")

You can type the name of the variable, a period, and then hit tab to view a list of all of the methods available to you:

parsedPDF.

ipython1

There are many options related to keys and values, so it appears the variable contains a dictionary. Let’s view the dictionary’s keys:

parsedPDF.viewkeys()

parsedPDF.keys()

The dictionary’s keys are metadata and content. Let’s take a look at the values associated with these keys:

parsedPDF["metadata"]

The value associated with the key “metadata” is another dictionary. As you’d expect based on the name of the key, its key-value pairs provide metadata about the parsed PDF.

ipython2

Now let’s take a look at the value associated with “content”.

parsedPDF["content"]

The value associated with the key “content” is a string. As you’d expect, the string contains the PDF’s text content.

ipython3

Now that we know the types of objects and values Tika provides to us, let’s write a Python script to parse all three of the PDFs. The script will iterate over the PDF files in a folder and, for each one, parse the text from the file, select the lines of text associated with the expenditures by agency and revenue sources tables, convert each of these selected lines of text into a Pandas DataFrame, display the DataFrame, and create and save a horizontal bar plot of the totals column for the expenditures and revenues. So, after you run this script, you’ll have six new plots, one for revenues and one for expenditures for each of the three PDF files, in the folder in which you ran the script.

Python Script

To parse the three PDFs, create a new Python script named parse_pdfs_with_tika.py and add the following lines of code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import glob
import os
import re
import sys
import pandas as pd
import matplotlib
matplotlib.use('AGG')
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'

from tika import parser

input_path = sys.argv[1]

def create_df(pdf_content, content_pattern, line_pattern, column_headings):
    """Create a Pandas DataFrame from lines of text in a PDF.

    Arguments:
    pdf_content -- all of the text Tika parses from the PDF
    content_pattern -- a pattern that identifies the set of lines
    that will become rows in the DataFrame
    line_pattern -- a pattern that separates the agency name or revenue source
    from the dollar values in the line
    column_headings -- the list of column headings for the DataFrame
    """
    list_of_line_items = []
    # Grab all of the lines of text that match the pattern in content_pattern
    content_match = re.search(content_pattern, pdf_content, re.DOTALL)
    # group(1): only keep the lines between the parentheses in the pattern
    content_match = content_match.group(1)
    # Split on newlines to create a sequence of strings
    content_match = content_match.split('\n')
    # Iterate over each line
    for item in content_match:
        # Create a list to hold the values in the line we want to retain
        line_items = []
        # Use line_pattern to separate the agency name or revenue source
        # from the dollar values in the line
        line_match = re.search(line_pattern, item, re.I)
        # Grab the agency name or revenue source, strip whitespace, and remove commas
        # group(1): the value inside the first set of parentheses in line_pattern
        agency = line_match.group(1).strip().replace(',', '')
        # Grab the dollar values, strip whitespace, replace dashes with 0.0, and remove $s and commas
        # group(2): the value inside the second set of parentheses in line_pattern
        values_string = line_match.group(2).strip().\
        replace('- ', '0.0 ').replace('$', '').replace(',', '')
        # Split on whitespace and convert to float to create a sequence of floating-point numbers
        values = map(float, values_string.split())
        # Append the agency name or revenue source into line_items
        line_items.append(agency)
        # Extend the floating-point numbers into line_items so line_items remains one list
        line_items.extend(values)
        # Append line_item's values into list_of_line_items to generate a list of lists;
        # all of the lines that will become rows in the DataFrame
        list_of_line_items.append(line_items)
    # Convert the list of lists into a Pandas DataFrame and specify the column headings
    df = pd.DataFrame(list_of_line_items, columns=column_headings)
    return df

def create_plot(df, column_to_sort, x_val, y_val, type_of_plot, plot_size, the_title):
    """Create a plot from data in a Pandas DataFrame.

    Arguments:
    df -- A Pandas DataFrame
    column_to_sort -- The column of values to sort
    x_val -- The variable displayed on the x-axis
    y_val -- The variable displayed on the y-axis
    type_of_plot -- A string that specifies the type of plot to create
    plot_size -- A list of 2 numbers that specifies the plot's size
    the_title -- A string to serve as the plot's title
    """
    # Create a figure and an axis for the plot
    fig, ax = plt.subplots()
    # Sort the values in the column_to_sort column in the DataFrame
    df = df.sort_values(by=column_to_sort)
    # Create a plot with x_val on the x-axis and y_val on the y-axis
    # type_of_plot specifies the type of plot to create, plot_size
    # specifies the size of the plot, and the_title specifies the title
    df.plot(ax=ax, x=x_val, y=y_val, kind=type_of_plot, figsize=plot_size, title=the_title)
    # Adjust the plot's parameters so everything fits in the figure area
    plt.tight_layout()
    # Create a PNG filename based on the plot's title, replace spaces with underscores
    pngfile = the_title.replace(' ', '_') + '.png'
    # Save the plot in the current folder
    plt.savefig(pngfile)

# In the Expenditures table, grab all of the lines between Totals and General Government
expenditures_pattern = r'Totals\n+(Legislative, Judicial, Executive.*?)\nGeneral Government:'

# In the Revenues table, grab all of the lines between 2015-16 and either Subtotal or Total
revenues_pattern = r'\d{4}-\d{2}\n(Personal Income Tax.*?)\n +[Subtotal|Total]'

# For the expenditures, grab the agency name in the first set of parentheses
# and grab the dollar values in the second set of parentheses
expense_pattern = r'(K-12 Education|[a-z,& -]+)([$,0-9 -]+)'

# For the revenues, grab the revenue source in the first set of parentheses
# and grab the dollar values in the second set of parentheses
revenue_pattern = r'([a-z, ]+)([$,0-9 -]+)'

# Column headings for the Expenditures DataFrames
expense_columns = ['Agency', 'General', 'Special', 'Bond', 'Totals']

# Column headings for the Revenues DataFrames
revenue_columns = ['Source', 'General', 'Special', 'Total', 'Change']

# Iterate over all PDF files in the folder and process each one in turn
for input_file in glob.glob(os.path.join(input_path, '*.pdf')):
    # Grab the PDF's file name
    filename = os.path.basename(input_file)
    print filename
    # Remove .pdf from the filename so we can use it as the name of the plot and PNG
    plotname = filename.strip('.pdf')

    # Use Tika to parse the PDF
    parsedPDF = parser.from_file(input_file)
    # Extract the text content from the parsed PDF
    pdf = parsedPDF["content"]
    # Convert double newlines into single newlines
    pdf = pdf.replace('\n\n', '\n')

    # Create a Pandas DataFrame from the lines of text in the Expenditures table in the PDF
    expense_df = create_df(pdf, expenditures_pattern, expense_pattern, expense_columns)
    # Create a Pandas DataFrame from the lines of text in the Revenues table in the PDF
    revenue_df = create_df(pdf, revenues_pattern, revenue_pattern, revenue_columns)
    print expense_df
    print revenue_df

    # Print the total expenditures and total revenues in the budget to the screen
    print "Total Expenditures: {}".format(expense_df["Totals"].sum())
    print "Total Revenues: {}\n".format(revenue_df["Total"].sum())

    # Create and save a horizontal bar plot based on the data in the Expenditures table
    create_plot(expense_df, "Totals", ["Agency"], ["Totals"], 'barh', [20,10], \
    plotname+"Expenditures")
    # Create and save a horizontal bar plot based on the data in the Revenues table
    create_plot(revenue_df, "Total", ["Source"], ["Total"], 'barh', [20,10], \
    plotname+"Revenues")

Save this code in a file named parse_pdfs_with_tika.py in the same folder as the one containing the three CA Budget PDFs. Then you can run the script on the command line with the following command:

./parse_pdfs_with_tika.py .

I added docstrings to the two functions, create_df and create_plot, and comments above nearly every line of code in an effort to make the code as self-explanatory as possible. I created the two functions to avoid duplicating code because we perform these operations twice for each file, once for revenues and once for expenditures. We use a for loop to iterate over the PDFs and for each one we extract the lines of text we care about, convert the text into a Pandas DataFrame, display some of the DataFrame’s information, and save plots of the total values in the revenues and expenditures tables.

Results

Terminal Output
(1 of 3 pairs of DataFrames)

terminal_output

PNG File: Expenditures by Agency 2015-16
(1 of 6 PNG Files)

2015-16CABudgetSummaryChartsExpenditures

In this post I’ve tried to convey that Tika is a great resource for parsing PDFs by demonstrating how you can use it to parse budget data from PDF documents provided by a government agency. As my friend’s experience illustrates, there may be other situations in which you need to extract data from PDFs. With Tika, PDFs become another rich source of data for your analysis.

Pandashells: Data Science with Python on the Command Line

I often find myself using a variety of unix commands, perl / sed / awk one-liners, and snippets of Python code to combine, clean, analyze, and visualize data. Switching between the command line tools and Python breaks up my workflow because I have to step away from the command line to run the Python code in the interpreter or a script.

That’s why, when I learned about Pandashells last year I got excited because it’s a set of tools for using Python, Pandas, and other members of the Python data stack on the command line. Since Pandashells is a bash API to Pandas, Statsmodels, Seaborn, and other libraries, it’s easy to integrate the work you’d do with these Python packages into your command line workflow.

pandashells_overview

Pandashells has a range of tools that enable you to accomplish many common data processing, analysis, and visualization tasks. The main tool is p.df, which loads your tabular data into a Pandas dataframe so you can use your favorite Pandas commands right on the command line. In addition, p.merge enables you to merge files. p.linspace and p.rand enable you to create linearly-spaced and random numbers. p.regress and p.regplot, p.plot, p.hist, p.facet_grid, and p.cdf enable you to perform multivariate linear regression and create a collection of standard plots.

A nice feature of Pandashells is that it comes with several example datasets, including election, electoral_college, sealevel, and tips, and its GitHub page presents several well-commented examples that help you get familiar with the syntax. The examples show you how to chain multiple Pandashells commands and combine them with other command line tools like head.

pandashells_example_data

Ever since I started using Pandashells, I’ve enjoyed being able to integrate Python code into my command line workflow. I’ve used it to merge datasets, parse and re-format dates, filter for specific rows and columns, create new columns and dummy variables, explore and summarize the data, create graphs and plots, and prepare the data for predictive modeling.

In this post I’d like to demonstrate how to use Pandashells to accomplish a variety of common data processing, analysis, and visualization tasks on the command line. First, I’ll present the commands according to the tasks you’re going to accomplish so the commands are short and you can skip to the tasks you’re interested in. Then, at the end of this post, I’ll provide an example of chaining several commands to prepare a dataset for predictive modeling.

 

INSTALL PANDASHELLS

To follow along with this post, you will need to install Pandashells, which you can do with one of the following commands, documented on its GitHub page:

pip install pandashells

pip install pandashells[console]

pip install pandashells[full]

conda install -c https://conda.anaconda.org/robdmc pandashells

 

 

DATA

The dataset we’ll use in these examples is the familiar customer churn dataset. You can download a copy of the dataset here: churn.csv

pandashells_churn_dataset

 

TUTORIAL

DELETE AN APOSTROPHE

To begin, one of the column headings in the file, Int’l Plan, contains an apostrophe. Pandashells has trouble with apostrophes in column headings because you enclose Pandashells commands in apostrophes, so let’s delete it.

The first sed command prints the result to the screen so you can confirm that it’s going to delete the apostrophe in the column heading. The second sed command uses the -i flag to actually make the change to the file in-place.

sed -e "s/'//g" churn.csv | head

sed -i -e "s/'//g" churn.csv

 

VIEW FIRST FEW ROWS OF THE DATAFRAME

Now that the file is ready, let’s use Pandashells to read the data into a Pandas dataframe and take a look at the header row and the first five data rows.

cat churn.csv | p.df 'df.head()'

pandashells_head

 

VIEW NUMBER OF ROWS AND COLUMNS

How many rows and columns does the dataset have? The shape command returns the number of rows and columns as a tuple. However, the p.df tool wants to return output as a dataframe and it has trouble converting the tuple into a dataframe, so assign the row value into a new column called rows, assign the column value into a new column called columns, and view the numbers in the first row using head(1). The “-o table” means you want to output the results in table format.

cat churn.csv | p.df 'df["rows"], df["columns"] = df.shape' 'df[["rows", "columns"]]' 'df.head(1)' -o table

pandashells_rows_columns

 

CHANGE COLUMN HEADINGS

The column headings contain a mix of uppercase and lowercase letters and the headings that are two words contain a space between the words. Let’s standardize the column headings by changing all of them to uppercase and converting any spaces to underscores.

cat churn.csv | p.df 'df.rename(columns={"State":"STATE", "Account Length":"ACCOUNT_LENGTH", "Area Code":"AREA_CODE", "Phone":"PHONE", "Intl Plan":"INTL_PLAN", "VMail Plan":"VMAIL_PLAN", "VMail Message":"VMAIL_MESSAGE", "Day Mins":"DAY_MINS", "Day Calls":"DAY_CALLS", "Day Charge":"DAY_CHARGE", "Eve Mins":"EVE_MINS", "Eve Calls":"EVE_CALLS", "Eve Charge":"EVE_CHARGE", "Night Mins":"NIGHT_MINS", "Night Calls":"NIGHT_CALLS", "Night Charge":"NIGHT_CHARGE", "Intl Mins":"INTL_MINS", "Intl Calls":"INTL_CALLS", "Intl Charge":"INTL_CHARGE", "CustServ Calls":"CUSTSERV_CALLS", "Churn?":"CHURN?"})' 'df.head()'

 

REMOVE ROWS THAT CONTAIN NaNs

This dataset doesn’t contain NaNs, but here are examples of how to eliminate rows that contain NaNs. The first command eliminates rows that have NaNs in any columns. The second command ensures there aren’t NaNs in specific columns.

cat churn.csv | p.df 'df[df.notnull()]' 'df.head()'

cat churn.csv | p.df 'df[df["Churn?"].notnull()]' 'df[df["Account Length"].notnull()]' 'df.head()'

 

KEEP ROWS WHERE VALUES MEET CONDITIONS

You often want to filter a dataset for rows with values that meet specific conditions. The first command filters for rows where the Account Length is greater than 145. The second command filters for rows where the International Charge is less than 2 and the Day Charge is greater than 45.

cat churn.csv | p.df 'df[df["Account Length"] > 145]' 'df.head()'

cat churn.csv | p.df 'df[(df["Intl Charge"] < 2.0) & (df["Day Charge"] > 45.0)]' 'df.head()'

 

KEEP ROWS WHERE VALUES ARE / ARE NOT IN A SET

In some cases you want to filter for rows where the values in a column are or are not in a specific set. The first command filters for rows where the value in the International Plan column is “yes”. The second command uses a tilde ‘~’ to negate the expression and filter for rows where the value in the column is NOT “yes”. I’ve found this second syntax useful in situations where a column contains a small set of invalid values. Then you can use the second command to eliminate rows that have these invalid values in the column.

cat churn.csv | p.df 'df[df["Intl Plan"].isin(["yes"])]' 'df.head()'

cat churn.csv | p.df 'df[~df["Intl Plan"].isin(["yes"])]' 'df.head()'

 

KEEP ROWS WHERE VALUES MATCH A PATTERN

In some cases you want to filter for rows where the values in a column match a specific pattern. You can filter for rows matching a pattern using startswith, endswith, and contains to specify where to look for the pattern. The first command filters for rows where the first letter in the State column is a capital K. The second command finds rows where the text in the State column contains a capital K.

cat churn.csv | p.df 'df[df["State"].str.startswith("K")]' 'df.head()'

cat churn.csv | p.df 'df[df["State"].str.contains("K")]' 'df.head()'

 

KEEP SPECIFIC COLUMNS

Sometimes a dataset contains more columns than you need. You can specify which columns to retain by specifying them as a list. The following command restricts the output to nine specific columns.

cat churn.csv | p.df 'df[["Account Length", "Intl Plan", "VMail Plan", "Day Charge", "Eve Charge", "Night Charge", "Intl Charge", "CustServ Calls", "Churn?"]]' 'df.head()'

 

CREATE NEW VARIABLES / COLUMNS

One common operation is creating new columns. You can create a new column by writing an expression on the right hand side of the equals sign that generates the values for the column and then assigning the values to the new column you specify on the left hand side of the equals sign.

The first command uses the existing “Churn?” column to create a new column called “churn”. The values in the “Churn?” column are True. and False., so the expression uses NumPy’s “where” function to convert the True.s and False.s into 1s and 0s, respectively. The second command creates a new column called “total_calls” that is the sum of the values in the day, evening, night, and international calls columns. Similarly, the third command creates a new column called “total_charges” that is the sum of the values in the day, evening, night, and international charges columns.

cat churn.csv | p.df 'df["churn"] = np.where(df["Churn?"] == "True.", 1, 0)' 'df.head()'

cat churn.csv | p.df 'df["total_calls"] = df["Day Calls"] + df["Eve Calls"] + df["Night Calls"] + df["Intl Calls"]' 'df.head()'

cat churn.csv | p.df 'df["total_charges"] = df["Day Charge"] + df["Eve Charge"] + df["Night Charge"] + df["Intl Charge"]' 'df.head()'

 

CREATE CATEGORICAL VARIABLE FROM VALUES IN ANOTHER COLUMN

One fun operation is creating a column for a new categorical variable that’s based on values in another column. You can do so using a list comprehension and if-else logic. For example, the following command uses the values in the “State” column to create a new categorical variable called “us_regions” that categorizes the states into the Census Bureau’s four designated regions: Northeast, Midwest, South, and West. As another example, I’ve used this type of command create a categorical variable of failure types based on keywords / substrings in another column containing verbose failure descriptions.

cat churn.csv | p.df 'df["us_regions"] = ["Northeast" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "Midwest" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "South" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "West" for state in df["State"]]' 'df.head()'

 

CREATE INDICATOR / DUMMY VARIABLES

Sometimes you want to convert a categorical variable into a set of indicator / dummy variables and add them to the existing dataframe. You can use Pandas’ get_dummies() function to create dummy variables and the concat() function to add them as new columns to the existing dataframe. For example, the following command uses our previous code to create a categorical variable called “us_regions” and then uses the get_dummies() and concat() functions to create four new indicator variables based on the values in “us_regions” and add them to the existing dataframe.

cat churn.csv | p.df 'df["us_regions"] = ["Northeast" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "Midwest" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "South" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "West" for state in df["State"]]' 'pd.concat([df, pd.get_dummies(df.us_regions)], axis=1)' 'df.head()'

 

ENSURE SPECIFIC DATE FORMAT (YYYY-MM-DD HH:MM:SS)

This dataset doesn’t contain dates, but here’s an example of ensuring that the dates in a column have a specific format. I added a newline after the word “contains” to make the command easier to read, but you wouldn’t include the newline when you use the command.

cat my_data.csv | p.df 'df[df["date_column"].str.contains
("[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2} [0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2}")]'

 

RESTRICT TO SPECIFIC DATETIME RANGE

Here is an example of restricting your dataset to a specific date range. The following command ensures the values in the “date_column” column are more recent than “2009-12-31 23:59:59”. That is, it eliminates data from before 2010.

cat my_data.csv | p.df 'df[df["date_column"] > "2009-12-31 23:59:59"]'

 

CALCULATE NUMBER OF DAYS BETWEEN TWO DATES

Here is an example of calculating the number of days between two dates. The new column is called “diff_in_days”. The values in the column are the number of days between the values in two columns called “recent_date_column” and “older_date_column”. To calculate the difference, I use the strptime() function inside two list comprehensions to convert the text values in the two date columns into datetime objects. Next, I use the zip function to pair up the two datetime objects in each row. Finally, I use the expression “str(i-j).split()[0] if “days” in str(i-j) else 1″ to subtract one date, i, from the other, j, convert the result of the subtraction into a string, and split the string on whitespace and extract the number portion if it contains the word “days” otherwise assign the value 1. For example, if the result of the subtraction is “10 days” I want the new column “diff_in_days” to contain the number 10. I added several newlines to make the command easier to read, but you wouldn’t include the newlines when you use the command.

cat my_data.csv | p.df 'df["diff_in_days"] =
[str(i-j).split()[0] if "days" in str(i-j) else 1
for i, j in zip(
[datetime.datetime.strptime(recent_date, "%Y-%m-%d %H:%M:%S")
for recent_date in df.recent_date_column],
[datetime.datetime.strptime(older_date, "%Y-%m-%d %H:%M:%S")
for older_date in df.older_date_column])]'

pandashells_dates_csv

pandashells_dates_output

 

UNIQUE VALUES IN A COLUMN

Now we can return to analyzing our churn dataframe. The following command enables you to view the unique values in the “Churn?” column. The “-o table” option means you want to display the output in table format, as opposed to csv or another format, and the “index” option means you want to display the titles for the rows in the output.

cat churn.csv | p.df 'sorted(df["Churn?"].unique())' -o table index

 

VALUE COUNTS FOR UNIQUE VALUES IN A COLUMN

The following command enables you to view the unique values in the “VMail Plan” column, as well as the number of times each of the values appears in the dataset. As with the previous command, it’s helpful to display the output in table format and to display the titles for the rows in the output.

cat churn.csv | p.df 'df["VMail Plan"].value_counts()' -o table index

 

DESCRIPTIVE STATISTICS FOR A COLUMN

The following commands demonstrate how to compute descriptive statistics for categorical and numeric columns. The statistics for a categorical variable are the count of observations, the number of unique values in the column, the top / most frequently occurring value, and the frequency of the top occurring value. The statistics for a numeric variable are count of observations, mean, standard deviation, minimum, 25th percentile, 50th percentile / median, 75th percentile, and maximum. You can use “.T” to transpose the output, as shown in the second command.

cat churn.csv | p.df 'df[["Churn?"]].describe()' -o table index

cat churn.csv | p.df 'df[["Intl Charge"]].describe().T' -o table index

 

CROSSTABS

The following command shows how to create a crosstab table. The values in the “Churn?” column are the rows in the table, and the values in the “Intl Plan” column are the columns in the table. By default, the values in the table are counts of observations in each category, but you can specify additional data and an aggregation function to calculate different statistics for each category.

cat churn.csv | p.df 'pd.crosstab(df["Churn?"], df["Intl Plan"])' -o table index

 

GROUP BY

Sometimes you want to view separate statistics for different categories in a categorical variable, e.g. separate descriptive statistics for men and women. The following two commands show you how to use the groupby() function to group the data by the values in the “Churn?” column and calculate descriptive statistics for the two groups. The first command calculates descriptive statistics for a categorical variable, “Intl Plan”, separately for those who churned, “True.”, and those who did not churn, “False.”. Similarly, the second command calculates descriptive statistics for a numeric variable, “Intl Charge”.

cat churn.csv | p.df 'df.groupby("Churn?")[["Intl Plan"]].describe().unstack("Churn?")' -o table index

cat churn.csv | p.df 'df.groupby("Churn?")[["Intl Charge"]].describe().unstack("Churn?")' -o table index

 

PIVOT TABLES

The following two commands illustrate how to create pivot tables. Both commands display statistics about the values in the “Intl Charge” column, grouped by two categorical variables, “Churn?” and “Intl Plan”. The “Churn?” values are the rows in the output table and the “Intl Plan” values are the columns in the output table. The first command displays the “count” of the “Intl Charge” values in each of the categories, and the second command displays the “mean” of the “Intl Charge” values in each of the categories.

cat churn.csv | p.df 'df.pivot_table(values=["Intl Charge"], index=["Churn?"], columns=["Intl Plan"], aggfunc="count")' -o table index

cat churn.csv | p.df 'df.pivot_table(values=["Intl Charge"], index=["Churn?"], columns=["Intl Plan"], aggfunc="mean")' -o table index

 

BAR CHART

The following command combines the p.df tool with the p.hist tool to display a histogram of the values in the “churn” column. The command uses NumPy’s where() function to ensure the values in the new column are numeric. The “-o csv” option means the column should be outputted in csv format. I added a newline before the “–savefig” argument so it’s separate from the rest of the command, but you can remove the newline and include it in the command if you want to save the figure to a file called bar_chart.png in a folder called plots.

cat churn.csv | p.df 'df["churn"] = np.where(df["Churn?"] == "True.", 1, 0)' 'df["churn"]' -o csv | p.hist --ylabel 'Count' --xlabel 'Churn? (0: No; 1: Yes)' --title 'Bar Chart of Dependent Variable: Churn?' --theme 'darkgrid' --palette 'muted'
--savefig 'plots/bar_chart.png'

pandashells_bar_chart

 

FACET GRID

The following command shows how you can use facet grid to create separate plots based on a categorical variable. The –col “Intl Plan” argument indicates you want to create separate plots for the categories in the “Intl Plan” column. The –args “churn” argument indicates you want to display the “churn” data in the plots. The –map pl.hist argument indicates you want to display histograms of the “churn” data. Again, I added a newline before the “–savefig” argument.

cat churn.csv | p.df 'df["churn"] = np.where(df["Churn?"] == "True.", 1, 0)' 'df[["churn", "Intl Plan"]]' -o csv | p.facet_grid --col "Intl Plan" --args "churn" --map pl.hist
--savefig 'plots/bar_chart_facet.png'

pandashells_bar_chart_facet

 

WRITE DATA TO A FILE

A very common operation is to write your cleaned data to a new output file once you’re finished processing it. The following two commands show different ways to write to an output file. The first command uses Pandas’ to_csv() function to write the data to a file called “dataset_cleaned.csv”. I include the index=False argument so it doesn’t write an additional column of row index values to the output file. The second command uses the “-o csv” option to output the data in CSV format and the greater than sign to redirect the output into the output file.

cat churn.csv | p.df 'df[["Account Length", "Intl Plan", "VMail Plan", "Day Charge", "Eve Charge", "Night Charge", "Intl Charge", "CustServ Calls", "Churn?"]]' 'df.to_csv("dataset_cleaned.csv", index=False)'

cat churn.csv | p.df 'df[["Account Length", "Intl Plan", "VMail Plan", "Day Charge", "Eve Charge", "Night Charge", "Intl Charge", "CustServ Calls", "Churn?"]]' -o csv > dataset_clean.csv

 

PUTTING IT ALL TOGETHER

I’ve demonstrated a variety of ways to clean, analyze, and visualize your data on the command line with Pandashells. In the preceding examples, I refrained from chaining several commands together into a long workflow so we could focus on short snippets of code that accomplish individual tasks.

In this last example, I present an extended workflow that prepares our churn dataset for predictive modeling. I present the workflow twice. The first time I add newlines after each of the commands so each one and the entire workflow is easier to read. The second time I present the commands as you’d actually write them so it’s easy to copy and paste them into a Terminal window.

The first command renames the column headings to uppercase and replaces spaces with underscores. The second command deletes any rows containing NaN values.

The next three commands use NumPy’s where() function to create three new numeric variables based on columns that contain text values. The next two commands create two new variables that are the sum of the day, evening, night, and international calls and charges columns, respectively. The next command creates a categorical variable that categorizes the values in the States column into the Census Bureau’s four designated regions. The next command uses Pandas’ get_dummies() and concat() functions to create indicator variables for the four regions and add them to the existing dataframe.

The penultimate command selects a subset of the columns in the dataframe. The final command writes the eleven selected columns to an output file called “churn_cleaned.csv”. Since I didn’t add the argument index=False, the file also contains an additional column of row index values. The argument index_label=”ID” gives that column a heading.

NEWLINES AFTER EACH COMMAND

cat churn.csv | p.df
'df.rename(columns={"State":"STATE", "Account Length":"ACCOUNT_LENGTH", "Area Code":"AREA_CODE", "Phone":"PHONE", "Intl Plan":"INTL_PLAN", "VMail Plan":"VMAIL_PLAN", "VMail Message":"VMAIL_MESSAGE", "Day Mins":"DAY_MINS", "Day Calls":"DAY_CALLS", "Day Charge":"DAY_CHARGE", "Eve Mins":"EVE_MINS", "Eve Calls":"EVE_CALLS", "Eve Charge":"EVE_CHARGE", "Night Mins":"NIGHT_MINS", "Night Calls":"NIGHT_CALLS", "Night Charge":"NIGHT_CHARGE", "Intl Mins":"INTL_MINS", "Intl Calls":"INTL_CALLS", "Intl Charge":"INTL_CHARGE", "CustServ Calls":"CUSTSERV_CALLS", "Churn?":"CHURN?"})'
'df[df.notnull()]'
'df["CHURN"] = np.where(df["CHURN?"] == "True.", 1, 0)'
'df["INT_PLAN"] = np.where(df["INTL_PLAN"] == "yes", 1, 0)'
'df["VM_PLAN"] = np.where(df["VMAIL_PLAN"] == "yes", 1, 0)'
'df["TOTAL_CALLS"] = df["DAY_CALLS"] + df["EVE_CALLS"] + df["NIGHT_CALLS"] + df["INTL_CALLS"]'
'df["TOTAL_CHARGES"] = df["DAY_CHARGE"] + df["EVE_CHARGE"] + df["NIGHT_CHARGE"] + df["INTL_CHARGE"]'
'df["USA_REGIONS"] = ["NORTHEAST" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "MIDWEST" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "SOUTH" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "WEST" for state in df["STATE"]]'
'pd.concat([df, pd.get_dummies(df.USA_REGIONS)], axis=1)'
'df[["CHURN", "ACCOUNT_LENGTH", "INT_PLAN", "VM_PLAN", "TOTAL_CALLS", "TOTAL_CHARGES", "CUSTSERV_CALLS", "NORTHEAST", "MIDWEST", "SOUTH", "WEST"]]'
'df.to_csv("churn_cleaned.csv", index_label="ID")'

 

ACTUAL COMMANDS (READY FOR COPY AND PASTE)

cat churn.csv | p.df 'df.rename(columns={"State":"STATE", "Account Length":"ACCOUNT_LENGTH", "Area Code":"AREA_CODE", "Phone":"PHONE", "Intl Plan":"INTL_PLAN", "VMail Plan":"VMAIL_PLAN", "VMail Message":"VMAIL_MESSAGE", "Day Mins":"DAY_MINS", "Day Calls":"DAY_CALLS", "Day Charge":"DAY_CHARGE", "Eve Mins":"EVE_MINS", "Eve Calls":"EVE_CALLS", "Eve Charge":"EVE_CHARGE", "Night Mins":"NIGHT_MINS", "Night Calls":"NIGHT_CALLS", "Night Charge":"NIGHT_CHARGE", "Intl Mins":"INTL_MINS", "Intl Calls":"INTL_CALLS", "Intl Charge":"INTL_CHARGE", "CustServ Calls":"CUSTSERV_CALLS", "Churn?":"CHURN?"})' 'df[df.notnull()]' 'df["CHURN"] = np.where(df["CHURN?"] == "True.", 1, 0)' 'df["INT_PLAN"] = np.where(df["INTL_PLAN"] == "yes", 1, 0)' 'df["VM_PLAN"] = np.where(df["VMAIL_PLAN"] == "yes", 1, 0)' 'df["TOTAL_CALLS"] = df["DAY_CALLS"] + df["EVE_CALLS"] + df["NIGHT_CALLS"] + df["INTL_CALLS"]' 'df["TOTAL_CHARGES"] = df["DAY_CHARGE"] + df["EVE_CHARGE"] + df["NIGHT_CHARGE"] + df["INTL_CHARGE"]' 'df["USA_REGIONS"] = ["NORTHEAST" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "MIDWEST" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "SOUTH" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "WEST" for state in df["STATE"]]' 'pd.concat([df, pd.get_dummies(df.USA_REGIONS)], axis=1)' 'df[["CHURN", "ACCOUNT_LENGTH", "INT_PLAN", "VM_PLAN", "TOTAL_CALLS", "TOTAL_CHARGES", "CUSTSERV_CALLS", "NORTHEAST", "MIDWEST", "SOUTH", "WEST"]]' 'df.to_csv("churn_cleaned.csv", index_label="ID")'

pandashells_workflow

With Pandashells, we were able to quickly read the raw data into a Pandas dataframe, clean the data, create new variables, filter for specific rows and columns, and write the cleaned data to a new output file without leaving the command line. Now, if we were so inclined, we could write a skll configuration file and run a collection of predictive models on the data from the command line. Conveniently, if our workflow involves additional command line operations or tools it’s easy to combine them with the code we’ve presented because Pandashells was designed to integrate well with existing command line tools.

I hope this post has given you ideas on how to use Pandashells to integrate Python, Pandas, and other data stack commands into your existing command line workflows. I’ve enjoyed using Pandashells for data merging and cleaning, quick ad-hoc analysis, and analysis and workflow proto-typing. I still use Python scripts when they’re more appropriate for the project, but it’s been a lot of fun performing common analysis tasks on the command line with Pandashells.

Simplifying scikit-learn Predictive Modeling with skll APIs and Configuration Files

One blog post I really enjoyed reading last year is yhat’s post, Predicting customer churn with scikit-learn. In the post, the author demonstrates how to estimate, cross-validate, and measure the performance of three predictive models for classification in Python with the scikit-learn package. At the time, I was more familiar with how to do predictive modeling in R with the caret package, as described in the well-written book Applied Predictive Modeling. However, since I also knew some Python and was interested in learning how to use it for predictive modeling, yhat’s post was exactly what I needed to get started.

More recently, I read and followed along with the examples in Jereon Janssens’s excellent book, Data Science at the Command Line. In Chapter 9, the author demonstrates how to use another Python package, the SciKit-Learn Laboratory (a.k.a. skll) package, to implement predictive modeling with scikit-learn from the command line. The skll package is very helpful because it provides an API interface to scikit-learn that simplifies the code you need to implement your predictive models. Additionally, it also enables you to specify your modeling parameters in a configuration file so you can run your models from the command line.

Since I’d become familiar with how to implement predictive modeling with scikit-learn, I really enjoyed learning to use skll’s API and configuration file interfaces. To see how skll’s interfaces simplify the process of using the scikit-learn package to implement predictive modeling, let’s estimate, cross-validate, and measure the performance of four predictive models with scikit-learn and then perform similar operations with skll’s interfaces.

DATA SET

The data set we’ll use in this post comes from the publicly available wine quality data sets, which are available here. There is a file for red wines and a file for white wines. The data set used in this post is these two files concatenated together (with only one header row). The dependent variable in the original research took on integer values representing wine quality, so the data set lends itself to regression tasks. However, in this post we’ll demonstrate how to perform classification tasks by classifying wines as either red or white. You can download the data and prepare the final data set by entering the following commands in a Terminal window:

# Download two CSV files, winequality-red.csv and winequality-white.csv
# from the UCI Machine Learning Repository
# and save the files as wine-red.csv and wine-white.csv
parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-{}.csv > wine-{}.csv" ::: red white

# Check that you now have wine-red.csv and wine-white.csv
# in your current working directory
find . -maxdepth 1 -name "wine*"

# For each file, convert uppercase to lowercase,
# convert semicolons to commas,
# convert spaces to underscores,
# remove double quotes, and
# add a column that says 'red' or 'white' depending on the file
for T in red white; do < wine-$T.csv tr '[A-Z]; ' '[a-z],_' | tr -d \" | sed "s/$/,${T}/" > wine-${T}-clean.csv; done

# Review the first five lines in each file to ensure the changes are correct
head -n 5 wine-{red,white}-clean.csv | fold

# Concatenate the red and white wine files into one file, wine-both-clean.csv
# Retain only one column heading
# Name the last column 'type' since it contains the words 'red' and 'white'
head -n 1 wine-red-clean.csv > wine-both-clean.csv; grep -v quality wine-red-clean.csv >> wine-both-clean.csv; grep -v quality wine-white-clean.csv >> wine-both-clean.csv; sed -i -e "1s/,red/,type/" wine-both-clean.csv

# Review the row counts for red (1,599) and white (4,898) wines
parallel --tag "grep -c {} wine-both-clean.csv" ::: red white

Now that we have our data set, wine-both-clean.csv, let’s move on to estimating our predictive models with scikit-learn.

SCIKIT-LEARN

In order to understand the ways in which skll simplifies the syntax needed to perform predictive modeling with the scikit-learn package, let’s first estimate, cross-validate, and measure the performance of four predictive models with scikit-learn. To do so, copy and paste the following code into a text editor and save the file as classify_wine_scikit_learn.py:

#!/usr/bin/env python
import sys
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LogR
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

input_file = sys.argv[1]

# Read the wine quality data into a Pandas data frame
wine_data_frame = pd.read_csv(input_file)

# Create a red wine binary variable named 'y' for classification
wine_type = wine_data_frame['type']
y = np.where(wine_type == 'red', 1., 0.)

# Specify X, the matrix of predictor variables
features = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
        'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
        'ph', 'sulphates', 'alcohol']
wine_features = wine_data_frame[features]
X = wine_features.as_matrix().astype(np.float)
print "\nFeature space holds %d observations and %d features" % (X.shape)
print "\nAccuracy if you predict all 0s (baseline or benchmark): %0.3f" % (accuracy_score(y, [0 for value in y.tolist()]))
print "\n*****************************************\n"

# Center and scale the predictor variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Specify a function for k-fold cross-validation
def run_cv(X, y, clf, **kwargs):
    kfold = KFold(len(y), n_folds=10, shuffle=True, random_state=123456789)
    print str(clf)
    y_pred = y.copy()
    from time import time
    t0 = time()
    fold = 0
    for train_index, test_index in kfold:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf.fit(X_train, y_train)
        y_pred[test_index] = clf.predict(X_test)
        fold += 1
        print "Finished fold:", str(fold)
    print "Cross-validation took %0.2f seconds." % (time() - t0)
    return y_pred

# K NEAREST NEIGHBORS
# Estimate predicted values for k-nearest-neighbors classification model with cross-validation
y_pred_knn = run_cv(X, y, KNN(n_neighbors=6))

# Calculate performance metrics for k-nearest-neighbors classification model
print "\nK-nearest-neighbors (accuracy): " + "%.3f" % accuracy_score(y, y_pred_knn)
print "\nKNN confusion matrix: "
print(confusion_matrix(y_pred_knn, y))
target_names = ['White (0)', 'Red (1)']
print "\nKNN classification report: "
print(classification_report(y, y_pred_knn, target_names=target_names))
print "\n*****************************************\n"

# LOGISTIC REGRESSION
# Estimate predicted values for logistic/logit classification model with cross-validation
y_pred_logr = run_cv(X, y, LogR(random_state=123456789))

# Calculate performance metrics for logistic/logit classification model
print "\nLogistic (accuracy): " + "%.3f" % accuracy_score(y, y_pred_logr)
print "\nLogistic confusion matrix: "
print(confusion_matrix(y_pred_logr, y))
target_names = ['White (0)', 'Red (1)']
print "\nLogistic classification report: "
print(classification_report(y, y_pred_logr, target_names=target_names))
print "\n*****************************************\n"

# RANDOM FOREST
# Estimate predicted values for random forest classification model with cross-validation
y_pred_rf = run_cv(X, y, RFC(n_estimators=500, random_state=123456789))

# Calculate performance metrics for random forest classification model
print "\nRandom forest (accuracy): " + "%.3f" % accuracy_score(y, y_pred_rf)
print "\nRF confusion matrix: "
print(confusion_matrix(y_pred_rf, y))
target_names = ['White (0)', 'Red (1)']
print "\nRF classification report: "
print(classification_report(y, y_pred_rf, target_names=target_names))
print "\n*****************************************\n"

# SUPPORT VECTOR MACHINES
# Specify grid parameters for the support vector machines classifier (SVC)
# Reference: http://scikit-learn.org/0.11/tutorial/statistical_inference/model_selection.html#grid-search
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
gamma_values = [0.01, 0.1, 1.0, 10.0, 100.0]
param_grid = dict(kernel=['rbf'], gamma=gamma_values, C=C_values)
svc_gscv = GridSearchCV(SVC(cache_size=1000, class_weight='auto', random_state=123456789), param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Estimate predicted values for support vector machines classification model with cross-validation
y_pred_svc = run_cv(X, y, svc_gscv)

# Calculate performance metrics for support vector machines classification model
print "\nSupport vector machines (accuracy): " + "%.3f" % accuracy_score(y, y_pred_svc)
print "\nSVM confusion matrix: "
print(confusion_matrix(y_pred_svc, y))
target_names = ['White (0)', 'Red (1)']
print "\nSVM classification report: "
print(classification_report(y, y_pred_svc, target_names=target_names))

This code is similar to the code discussed in yhat’s post, Predicting customer churn with scikit-learn, so instead of describing the lines of code in detail I’ll summarize the sections of code. At the top, we import all of the functions we’re going to use to manage and transform the data, estimate the models, perform k-fold cross-validation, and evaluate the performance of the models.

Next, we read the data into a Pandas data frame and create a new binary variable named ‘y’ that equals 1.0 for red wines and 0.0 for white wines. Next, we specify which variables are going to serve as the independent, explanatory, predictor variables and transform them into a matrix. At this point, we print the number of rows and columns in the analysis data set and print the baseline, benchmark accuracy we achieve by naïvely predicting all of the observations to be zeros (i.e. the majority of the observations are zeros). Hopefully, the performance of our predictive models will surpass this benchmark performance!

Next, we center and scale the predictor variables so they all have a mean of zero and a standard deviation of one. These transformations help ensure that the learning algorithms and predictive performance of the statistical models aren’t influenced by the predictor variables’ units of measurement.

Next, we define a function for performing k-fold cross-validation. The function specifies a particular random state to be consistent with skll’s default random state. It also includes a few print statements so you can see the folds being processed and see how long the cross-validation takes for different models. You can find an example of stratified k-fold cross-validation, a related procedure that attempts to balance the percentage of observations from each class in each fold, in Bugra’s blog post, An Introduction to Supervised Learning via Scikit Learn. This post also demonstrates how to produce plots of training and test errors, confusion matrices, and variable importance scores.

Finally, we run cross-validation with four predictive models, (1) k-nearest neighbors, (2) logistic regression, (3) random forest, and (4) support vector machines and report on their performance with an accuracy score, a confusion matrix, and a classification report. The confusion matrix shows the number of correctly classified observations along the main diagonal and the classification report includes information on precision, recall, f1-score, and support for each category.

One final point is that the support vector machine section demonstrates how to use grid search to select optimal parameters for the model. To be consistent with skll’s grid search defaults, I instructed grid search to use 3-fold cross-validation, a particular random state, and accuracy to determine the optimal values for C and gamma. This 3-fold cross-validation takes place within the 10-fold cross-validation taking place to measure the model’s predictive performance.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_scikit_learn.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_scikit_learn.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

scikit-learn script output

As you can see, you can achieve an accuracy score of 0.75 by naïvely predicting all of the observations to be zero. Since the rest of the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.18 seconds

K-Nearest Neighbors
Accuracy: 0.994
Processing time: 0.50 seconds

Random Forest
Accuracy: 0.995
Processing time: 36.44 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 225.42 seconds

The output shows that, in this example, all of the models achieve similar accuracy scores. It also shows that logistic regression completes 3 times faster than k-nearest-neighbors and 1,252 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with scikit-learn, let’s take a look at how to do so with skll’s APIs to scikit-learn.

SKLL API

If you want to follow along with the next two sections you’ll need to install skll. Here are instructions for doing so: skll There can be compatibility issues between scikit-learn and skll, depending on the versions you have installed. To follow along with this post, make sure you have scikit-learn==0.15.2 and skll==1.0.1. You can check which versions you have by typing the following on the command line and then hitting Enter: pip freeze. Once you’ve installed skll, copy and paste the following code into a text editor and save the file as classify_wine_skll.py:

#!/usr/bin/env python
import sys
import time
from skll.data.readers import Reader
from skll.learner import Learner

input_file = sys.argv[1]

file_reader = Reader.for_path(input_file, label_col='type')

training_data = file_reader.read()

number_of_folds = 10

def average_accuracy(fold_results):
    import numpy as np
    accuracy = []
    for fold_index in range(number_of_folds):
        accuracy.append(fold_result_list[fold_index][1])
    return np.mean(accuracy)

print "\nLogistic:"
logistic_learner = Learner('LogisticRegression', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nKNN:"
knn_learner = Learner('KNeighborsClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = knn_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nRF:"
rf_learner = Learner('RandomForestClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nSVC:"
svc_learner = Learner('SVC', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = svc_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=True, grid_search_folds=3, grid_objective=u'f1_score_micro', \
param_grid=[{'C':[0.01, 0.1, 1.0, 10.0, 100.0]}, {'gamma':[0.01, 0.1, 1.0, 10.0, 100.0]}], shuffle=True)
print "Grid search %d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

One of the first things you’ll notice is how much less code we need to run the same cross-validation exercise we ran in the previous section. At the top, we import skll’s Reader and Learner objects, which we’ll use to read in the training data and to specify the statistical learning models we want to cross-validate, respectively.

You’ll notice that when we use the Reader object to create a file reader object we specify the name of the column that’s to be used as the dependent, ‘y’ variable. In this example, we specify that the label column is ‘type’ because it contains the class labels. We also create a variable for the number of cross-validation folds so the code is a little more flexible and you can experiment with 5-fold, 10-fold, or 20-fold cross-validation more easily.

The next block of code defines a function to calculate the average accuracy of the model based on cross-validation. We create this simple helper function because, as described in the API’s documentation, the output of cross-validation is a pair of lists. The first list contains the confusion matrix, overall accuracy, per-label precision, recall, and F-measures, and model parameters for each fold. The second list contains the grid search scores, if any, for each fold. This helper function collects all of the accuracy scores from the cross-validation output and returns the average accuracy across all of the folds.

The next four blocks of code initialize the four statistical learning models, cross-validate the models, and report their results. For example, the first learner is logistic regression. We’re not interested in calculating probabilities, so we set probability equal to False. To be consistent with the centering and scaling we did to the predictor variables in base scikit-learn, we set feature scaling to both.

Next, we perform cross-validation with the model. In this case, we set stratified equal to True to use stratified k-fold cross-validation so each fold contains nearly the same number of red and white wines. Since we set number_of_folds to 10 and cv_folds equals number_of_folds, we’re performing 10-fold cross-validation. For logistic regression, we’re not employing grid search, and we set shuffle equal to True so the observations are shuffled before they’re split into folds for cross-validation. Some of these options are the defaults, and some of them are changed to be consistent with the scikit-learn code we used above.

The k-nearest neighbors and random forest sections are identical to the logistic regression section, except for employing different predictive models. To be consistent with the scikit-learn code we used above, the support vector machines section contains some additional code to employ grid search for optimal C and gamma values.

Finally, we print how long cross-validation takes to complete, all of the results for each of the folds, the grid search scores (if any), and the average accuracy across all of the folds.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_skll.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_skll.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

skll script output

Since the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.85 seconds

K-Nearest Neighbors
Accuracy: 0.993
Processing time: 1.22 seconds

Random Forest
Accuracy: 0.994
Processing time: 0.88 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 57.60 seconds

Once again the output shows that, in this example, all of the models achieve similar accuracy scores. In this case, logistic regression completes 1.4 times faster than k-nearest-neighbors and 68 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with skll’s APIs, let’s take a look at how to do so on the command line with skll’s configuration file.

SKLL CONFIGURATION FILE

As I mentioned at the top of this post, Jereon Janssens demonstrates how to use skll’s configuration file set-up in his book, Data Science at the Command Line. You can also read skll’s own tutorial for using the configuration file set-up here.

skll’s configuration file set-up requires an input file to be formatted slightly differently than the way ours is now, so we need to modify our input file. Specifically, we need (1) an additional row index column with unique numbers for each row, (2) the wine type column to contain the floating-point numbers 1.0 and 0.0 instead of the strings red and white, and (3) the file should only contain the binary ‘type’ column and the predictor variables we want to use (i.e. it shouldn’t contain additional variables we don’t intend to use). Since our input file contains the ‘quality’ column and we don’t want to use it as a predictor we need to remove it from the file.

You can create the modified version of the input file by entering the following one-line, piped command in a Terminal window:

< wine-both-clean.csv nl -s, -w4 -n rz -v0 | sed 's/0000,/id,/' | sed 's/red/1\./' | sed 's/white/0\./' | cut -d, -f1-12,14 > wine-both-clean-ids.csv

The nl command adds line numbers to each of the rows. -s, says add a comma after the line number (since we’re using a CSV file). -w4 says make the line numbers four digits wide (we know the input file contains 6,498 rows, so four digits wide will be sufficient). -n rz says insert line numbers according to the rz format, which is right-justified with leading zeros. -v0 says start the line numbering at 0, or in this case, 0000.

Next, the first sed command changes the line number in the first line from 0000, to the column heading id,. The second sed command changes the word red into ‘1.’, and the third sed command changes the word white into ‘0.’. The cut command separates the file into columns based on the comma delimiter and selects, or keeps, columns one to twelve and column fourteen (thereby removing the ‘quality’ variable in column thirteen). The result of these operations is redirected to a new output file called wine-both-clean-ids.csv. We’ll use this new file as our input file.

Now that our input file is ready to be processed, let’s create our skll configuration file. To do so, copy and paste the following code into a text editor and save the file as classify_wine_skll.cfg (Note that the file extension is .cfg instead of .py):

[General]
    experiment_name = Wine
    task = cross_validate

[Input]
    train_directory = .
    featuresets = [["wine-both-clean-ids.csv"]]
    learners = ["RandomForestClassifier", "SVC", "KNeighborsClassifier", "LogisticRegression"]
    label_col = type
    id_col = id
    shuffle = True
    feature_scaling = both

[Tuning]
    objective = accuracy
    grid_search = True
    param_grids = [[], [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}], [], []]

[Output]
    log = results
    results = results
    predictions = results

skll’s configuration file contains four sections, General, Input, Tuning, and Output. You can learn more about each of these sections here. In the General section, we specify a name for the experiment. All of the output file names will start with the word we use. We also specify the task we want to perform, which is cross_validate.

In the Input section, we specify the train_directory, the folder where the configuration file can find the training data set. The period, ‘.’, is shorthand for the current folder. If you save the input file in a different folder, then you’ll have to supply the folder name (e.g. training_data or my_input_files). featuresets contains the name of the input file(s) that contain your features, i.e. explanatory variables. In this case they’re all in one file but, as skll’s tutorial demonstrates, they can be spread across multiple input files. learners is a list of the learners we want to use. label_col indicates which column contains the class labels. id_col indicates which column contains the unique row numbers. Once again, to be consistent with the preceding examples, shuffle equals True and feature_scaling equals both.

In the Tuning section, we specify we want to use accuracy as our objective. Since we want to use grid search for one of our models, we set grid_search to True. When grid search is True, you have to supply a list of lists to param_grids, one list for each model. A list can be empty if you don’t want to perform grid search for the model, but you still need a set of square brackets for each model. If you do want to perform grid search for a model, then you supply a dictionary of the parameters and values you want to search over and optimize. Since we’re performing grid search for the support vector machines model we supply ‘C’ and ‘gamma’ as two keys and lists of values to search over as the values associated with each key.

Finally, in the Output section, we specify the name of the folder we want all of the output to be saved in. In this case, all of the logs, results, and predictions will be saved in a folder named ‘results’ inside our current folder.

Run this configuration file (a.k.a. experiment) by typing the following on the command line and then hitting Enter:
run_experiment classify_wine_skll.cfg

After you hit Enter, you’ll see the following output printed to your command prompt window after all four of the models have completed:

Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done

In addition, several output files have been written in the ‘results’ folder inside your current folder. You can cd and/or ls into the results folder to view the output. skll creates four files for each model: a log file, a predictions file, a results file, and a results file formatted as JSON. skll also creates a summary file that contains details on each of the models and each of the cross-validation folds.

One of the measures we’ve been using to compare the models is average accuracy. This value is available in the summary file. Jereon Janssens demonstrates how to access and print this value using some helpful command line tools in his book, but you can also print it out with basic Unix commands. To view the average accuracy for each of the models, type the following command on the command line and then hit Enter (assuming you named the output folder ‘results’ and it’s inside your current folder):

grep average results/Wine_summary.tsv | cut -f1,13 | awk -F\t '{ print $2 ": " $1 }'

The grep command filters for rows in the tab-delimited summary file that contain the word average. The cut command separates the columns in the file based on tabs, which we didn’t have to specify because it’s the default in cut. Then we select the first column (i.e. the average accuracy score) and the thirteenth column (i.e. the name of the classifier). Finally, the awk command re-arranges the two pieces of information so that what’s printed to the screen is the name of the classifier, followed by a colon and a space, and then the average accuracy score. When you run this command you’ll see the following output printed to the Terminal window:

skll configuration file output

Once again, all of the models have similar accuracy scores.

As we’ve seen, skll’s APIs and configuration file set-up encapsulate and simplify a lot of the basic scikit-learn code you need to read input data, transform variables, and estimate, cross-validate, and measure the performance of predictive models. Now that you’re familiar with skll’s general interfaces and syntax, try modifying the code to use your own input data, estimate different models, measure performance with a different metric, or perform other tasks, like predict or evaluate instead of cross_validate. Also be sure to check out the additional resources noted throughout this post for supplementary explanations and examples. Have a great time experimenting with skll’s interfaces!

Getting into Class with Python, Flowers, and Dictionaries

A few weeks ago, I received an email inviting me to sign up for a programming-based training class. I clicked on the link to sign up and was directed to a registration page. The registration page included the usual fields, e.g. name and email, and it also required you to submit your code for two programming questions. I enjoyed having to write two snippets of Python code to sign up for the class, so this post is about these two small scripts.

Both questions were based on the ubiquitous Iris data set, which includes data on the sepal length, sepal width, petal length, and petal width for three species of flower, i.e. versicolor, setosa, and virginica. The file contains a header row and fifty rows of data for each species of flower:

iris_data_set
The first question asked you to count the number of rows associated with each type of flower. The second question asked you to calculate the average petal length for each type of flower. Let’s discuss these two questions in turn.

Question #1: Count the number of rows associated with each type of flower

Let’s take a look at the Python code I submitted to answer this question, and then I’ll explain the code:

#!/usr/bin/env python
from string import strip
import sys
input_file = sys.argv[1]
all_species = {}
with open(input_file, 'r') as file_handle:
    header = file_handle.readline()
    for row in file_handle:
        row_list = row.strip().split(',')
        species = row_list[4]
        if species not in all_species:
            all_species[species] = 1
        else:
            all_species[species] += 1

for species, count in all_species.items():
    print "%s: %d" % (species, count)

Whenever I’m asked to group or bin data into categories I consider using a Python dictionary (a.k.a. hash or associative array). Grouping or binning data in a dictionary is helpful because you don’t have to specify the number of categories ahead of time (i.e. new categories are added as keys in the dictionary as they’re encountered in the data) and you’re guaranteed to have distinct categories (i.e. dictionary keys must be unique).

Line one is the shebang line that instructs Unix and Mac computers where to find the Python program that will be used to run this code. Line two imports the strip function from Python’s built-in string module so we can remove the newline character from the end of each row of input. Line three imports the built-in sys module so we can use it to pass arguments from the command line into the script.

Line four uses the sys module to capture the first argument on the command line after the name of my script and assign the value to a variable called input_file. Since I named my script species_count.py and the input file is called iris.csv, the complete input for the command line is: python species_count.py iris.csv

Line five uses curly braces to create an empty dictionary called all_species.

Line six uses a with statement to open the input file for reading, ‘r’, and creates a file handle for reading the data in the file. Line seven uses the readline function to read the first row of data in the input file, i.e. the header row, into a variable called header. I need to remove the header row because I only need the data rows to answer the question.

Line eight is a for loop that loops over the data rows, one row at a time; therefore, lines nine to fourteen occur for every row in the input file (You can also tell that this is the case because lines nine to fourteen are indented beneath line eight). When each data row enters line nine it enters as a string, so first the strip function removes the newline character from the end of the string and then the split function converts the string into a list (a.k.a. array) by splitting the string on commas, which are the column delimiters. The resulting list is assigned to a variable called row_list.

Line ten uses list indexing to capture the fifth element in the list, i.e. the type of flower, and assign the value to a variable called species. Remember, index values start at zero, i.e. 0, 1, 2, 3, 4, …, so index number four refers to the fifth element in the list.

Line eleven is an if-else statement that tests whether the value in the variable species, i.e. the type of flower, is not already a key in the dictionary called all_species. If it is not, then line twelve adds the value as a key in the dictionary and assigns the key a value of one. If the value in the variable species is already a key in the dictionary, then line twelve is skipped and line fourteen adds one to the value associated with the key.

Let’s discuss a specific example to make sure the operations in lines eleven to fourteen are clear. Let’s say the first two data rows in the input file are for setosa flowers. When the first data row is processed, there are no keys in the dictionary so line eleven will be true and line twelve will add “setosa” as a key in the all_species dictionary and set the associated value to one. When the second data row, which is also a setosa row, is processed, “setosa” is already a key in the dictionary so line twelve is skipped and line fourteen adds one to the value associated with the key “setosa”.

Finally, lines sixteen and seventeen are a for loop and print statement for looping over the dictionary’s keys and values and printing them to the screen. The for loop uses the dictionary’s items function to unpack the dictionary’s keys and values into the variables species and count. Then line sixteen uses string formatting to format the value in species, i.e. the type of flower, as a string, %s, and the value in count as an integer, %d.

The following is the result of running this script on the input file:

species_count
As you can see, there are 50 data rows for each type of flower.

Question #2: Calculate the average petal length for each type of flower

Let’s take a look at the Python code I submitted to answer this question, and then I’ll explain the code:

#!/usr/bin/env python
from string import strip
import sys
input_file = sys.argv[1]
all_species = {}
with open(input_file, 'r') as file_handle:
    header = file_handle.readline()
    for row in file_handle:
        row_list = row.strip().split(',')
        petal_length = row_list[2]
        species = row_list[4]
        if species not in all_species:
            all_species[species] = [float(petal_length), 1]
        else:
            all_species[species][0] += float(petal_length)
            all_species[species][1] += 1

for species, values in all_species.items():
    print '%s: %0.2f' % (species, values[0]/values[1])

Lines one to nine are identical to the lines of code we’ve already discussed, so I’ll focus on the new lines of code I used to calculate the average petal length for each type of flower.

Lines ten and eleven capture two specific elements from each data row once the row has been converted into a list variable. Line ten captures the third element in the list, i.e. the petal length, and line eleven captures the fifth element, i.e. the type of flower.

Line twelve is identical to line eleven in the previous script we discussed. It is an if-else statement that tests whether the value in the variable species, i.e. the type of flower, is not already a key in the dictionary called all_species. If it is not, then line thirteen adds the value as a key in the dictionary and assigns a list with two elements as the key’s associated value. The first element in the list is the value in the variable petal_length, converted to a floating-point number with the float function, and the second element is the number one.

If the value in the variable species is already a key in the dictionary, then line thirteen is skipped and lines fifteen and sixteen add the floating-point version of the value in the variable petal-length to the first element in the list, [0], associated with the key and add one to the second element in the list, [1], associated with the key, respectively.

Lines twelve to sixteen ensure that after all of the data rows are processed the dictionary contains three keys, i.e. for the three species of flowers in the input file, and each key’s value is a list with two elements. The first element in the list contains the sum of all of the petal lengths for that type of flower and the second element in the list contains the number of rows associated with that type of flower. Next, we’ll use these two values to calculate the average petal length for each type of flower.

Lines eighteen and nineteen are a for loop and print statement for looping over the dictionary’s keys and values and printing them to the screen. The for loop uses the dictionary’s items function to unpack the dictionary’s keys and values into the variables species and values. Line nineteen uses string formatting to format the two values passed into the print statement from the parentheses. The first value is the name of the species, formatted as a string, %s. The second value is the average petal length for the species, which is the result of dividing the two elements in each list and formatting the result as a floating-point number rounded to two decimal places, %0.2f.

The following is the result of running this script on the input file:

iris_avg_petal_length
It was a lot of fun creating these two scripts to register for the class. Despite being short, they illustrate how you can use a Python dictionary to group or bin data into unique categories and how to calculate basic statistics. If you’re new to Python or the dictionary data structure, try downloading the Iris data set or another text or CSV file from the Internet and re-creating these scripts to practice organizing data into a dictionary. Once you get the hang of it, you’ll have another powerful data structure you can use to organize and analyze data. Have fun!

To Read or Not to Read? Hopefully the Former!

It has been far too long since I posted to this blog.  It’s time to let you know what I’ve been working on over the past few months.  I’ve been writing a book on a topic that is near and dear to my heart.

The book is titled Multi-objective Decision Analysis: Managing Trade-offs and Uncertainty.  It is an applied, concise book that explains how to conduct multi-objective decision analyses using spreadsheets.

The book is scheduled to be published by Business Expert Press in 2013.  For a little more information about my forthcoming book, please read the abstract shown below:

“Whether managing strategy, operations, or products, making the best decision in a complex uncertain business environment is challenging.  One of the major difficulties facing decision makers is that they often have multiple, competing objectives, which means trade-offs will need to be made.  To further complicate matters, uncertainty in the business environment makes it hard to explicitly understand how different objectives will impact potential outcomes.  Fortunately, these problems can be solved with a structured framework for multi-objective decision analysis that measures trade-offs among objectives and incorporates uncertainties and risk preferences.

This book is designed to help decision makers by providing such an analysis framework implemented as a simple spreadsheet tool.  This framework helps structure the decision making process by identifying what information is needed in order to make the decision, defining how that information should be combined to make the decision and, finally, providing quantifiable evidence to clearly communicate and justify the final decision.

The process itself involves minimal overhead and is perfect for busy professionals who need a simple, structured process for making, tracking, and communicating decisions.  With this process, decision making is made more efficient by focusing only on information and factors that are well-defined, measureable, and relevant to the decision at hand.  The clear characterization of the decision required by the framework ensures that a decision can be traced and is consistent with the intended objectives and organizational values.  Using this structured decision-making framework, anyone can effectively and consistently make better decisions to gain a competitive and strategic advantage.”

Look for my forthcoming book, Multi-objective Decision Analysis, on the bookshelves in 2013!

Source: favim.com