Getting into Class with Python, Flowers, and Dictionaries

A few weeks ago, I received an email inviting me to sign up for a programming-based training class. I clicked on the link to sign up and was directed to a registration page. The registration page included the usual fields, e.g. name and email, and it also required you to submit your code for two programming questions. I enjoyed having to write two snippets of Python code to sign up for the class, so this post is about these two small scripts.

Both questions were based on the ubiquitous Iris data set, which includes data on the sepal length, sepal width, petal length, and petal width for three species of flower, i.e. versicolor, setosa, and virginica. The file contains a header row and fifty rows of data for each species of flower:

iris_data_set
The first question asked you to count the number of rows associated with each type of flower. The second question asked you to calculate the average petal length for each type of flower. Let’s discuss these two questions in turn.

Question #1: Count the number of rows associated with each type of flower

Let’s take a look at the Python code I submitted to answer this question, and then I’ll explain the code:

#!/usr/bin/env python
from string import strip
import sys
input_file = sys.argv[1]
all_species = {}
with open(input_file, 'r') as file_handle:
    header = file_handle.readline()
    for row in file_handle:
        row_list = row.strip().split(',')
        species = row_list[4]
        if species not in all_species:
            all_species[species] = 1
        else:
            all_species[species] += 1

for species, count in all_species.items():
    print "%s: %d" % (species, count)

Whenever I’m asked to group or bin data into categories I consider using a Python dictionary (a.k.a. hash or associative array). Grouping or binning data in a dictionary is helpful because you don’t have to specify the number of categories ahead of time (i.e. new categories are added as keys in the dictionary as they’re encountered in the data) and you’re guaranteed to have distinct categories (i.e. dictionary keys must be unique).

Line one is the shebang line that instructs Unix and Mac computers where to find the Python program that will be used to run this code. Line two imports the strip function from Python’s built-in string module so we can remove the newline character from the end of each row of input. Line three imports the built-in sys module so we can use it to pass arguments from the command line into the script.

Line four uses the sys module to capture the first argument on the command line after the name of my script and assign the value to a variable called input_file. Since I named my script species_count.py and the input file is called iris.csv, the complete input for the command line is: python species_count.py iris.csv

Line five uses curly braces to create an empty dictionary called all_species.

Line six uses a with statement to open the input file for reading, ‘r’, and creates a file handle for reading the data in the file. Line seven uses the readline function to read the first row of data in the input file, i.e. the header row, into a variable called header. I need to remove the header row because I only need the data rows to answer the question.

Line eight is a for loop that loops over the data rows, one row at a time; therefore, lines nine to fourteen occur for every row in the input file (You can also tell that this is the case because lines nine to fourteen are indented beneath line eight). When each data row enters line nine it enters as a string, so first the strip function removes the newline character from the end of the string and then the split function converts the string into a list (a.k.a. array) by splitting the string on commas, which are the column delimiters. The resulting list is assigned to a variable called row_list.

Line ten uses list indexing to capture the fifth element in the list, i.e. the type of flower, and assign the value to a variable called species. Remember, index values start at zero, i.e. 0, 1, 2, 3, 4, …, so index number four refers to the fifth element in the list.

Line eleven is an if-else statement that tests whether the value in the variable species, i.e. the type of flower, is not already a key in the dictionary called all_species. If it is not, then line twelve adds the value as a key in the dictionary and assigns the key a value of one. If the value in the variable species is already a key in the dictionary, then line twelve is skipped and line fourteen adds one to the value associated with the key.

Let’s discuss a specific example to make sure the operations in lines eleven to fourteen are clear. Let’s say the first two data rows in the input file are for setosa flowers. When the first data row is processed, there are no keys in the dictionary so line eleven will be true and line twelve will add “setosa” as a key in the all_species dictionary and set the associated value to one. When the second data row, which is also a setosa row, is processed, “setosa” is already a key in the dictionary so line twelve is skipped and line fourteen adds one to the value associated with the key “setosa”.

Finally, lines sixteen and seventeen are a for loop and print statement for looping over the dictionary’s keys and values and printing them to the screen. The for loop uses the dictionary’s items function to unpack the dictionary’s keys and values into the variables species and count. Then line sixteen uses string formatting to format the value in species, i.e. the type of flower, as a string, %s, and the value in count as an integer, %d.

The following is the result of running this script on the input file:

species_count
As you can see, there are 50 data rows for each type of flower.

Question #2: Calculate the average petal length for each type of flower

Let’s take a look at the Python code I submitted to answer this question, and then I’ll explain the code:

#!/usr/bin/env python
from string import strip
import sys
input_file = sys.argv[1]
all_species = {}
with open(input_file, 'r') as file_handle:
    header = file_handle.readline()
    for row in file_handle:
        row_list = row.strip().split(',')
        petal_length = row_list[2]
        species = row_list[4]
        if species not in all_species:
            all_species[species] = [float(petal_length), 1]
        else:
            all_species[species][0] += float(petal_length)
            all_species[species][1] += 1

for species, values in all_species.items():
    print '%s: %0.2f' % (species, values[0]/values[1])

Lines one to nine are identical to the lines of code we’ve already discussed, so I’ll focus on the new lines of code I used to calculate the average petal length for each type of flower.

Lines ten and eleven capture two specific elements from each data row once the row has been converted into a list variable. Line ten captures the third element in the list, i.e. the petal length, and line eleven captures the fifth element, i.e. the type of flower.

Line twelve is identical to line eleven in the previous script we discussed. It is an if-else statement that tests whether the value in the variable species, i.e. the type of flower, is not already a key in the dictionary called all_species. If it is not, then line thirteen adds the value as a key in the dictionary and assigns a list with two elements as the key’s associated value. The first element in the list is the value in the variable petal_length, converted to a floating-point number with the float function, and the second element is the number one.

If the value in the variable species is already a key in the dictionary, then line thirteen is skipped and lines fifteen and sixteen add the floating-point version of the value in the variable petal-length to the first element in the list, [0], associated with the key and add one to the second element in the list, [1], associated with the key, respectively.

Lines twelve to sixteen ensure that after all of the data rows are processed the dictionary contains three keys, i.e. for the three species of flowers in the input file, and each key’s value is a list with two elements. The first element in the list contains the sum of all of the petal lengths for that type of flower and the second element in the list contains the number of rows associated with that type of flower. Next, we’ll use these two values to calculate the average petal length for each type of flower.

Lines eighteen and nineteen are a for loop and print statement for looping over the dictionary’s keys and values and printing them to the screen. The for loop uses the dictionary’s items function to unpack the dictionary’s keys and values into the variables species and values. Line nineteen uses string formatting to format the two values passed into the print statement from the parentheses. The first value is the name of the species, formatted as a string, %s. The second value is the average petal length for the species, which is the result of dividing the two elements in each list and formatting the result as a floating-point number rounded to two decimal places, %0.2f.

The following is the result of running this script on the input file:

iris_avg_petal_length
It was a lot of fun creating these two scripts to register for the class. Despite being short, they illustrate how you can use a Python dictionary to group or bin data into unique categories and how to calculate basic statistics. If you’re new to Python or the dictionary data structure, try downloading the Iris data set or another text or CSV file from the Internet and re-creating these scripts to practice organizing data into a dictionary. Once you get the hang of it, you’ll have another powerful data structure you can use to organize and analyze data. Have fun!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s