Visualizing Global Land Temperatures in Python with scrapy, xarray, and cartopy

A few years ago, I worked on a project that involved collecting data on a variety of global environmental conditions over time. Some of the data sets included cloud cover, rainfall, types of land cover, sea temperature, and land temperature. I enjoyed developing a greater understanding of our Earth by visualizing how these conditions vary over time around the planet. To get a sense of how fun and informative it can be to analyze environmental data over time, let’s work on visualizing global land surface temperatures from 2001 to 2016.

12-monthly-averages

Data

The data we’ll use in this post are NASA Earth Observation’s Land Science Team’s daytime land surface temperatures, “temperatures of the “skin” (or top 1 millimeter) of the land surface during the daytime, collected by the Moderate Resolution Imaging Spectroradiometer (MODIS), an instrument on NASA’s Terra and Aqua satellites”. Temperatures in the data range from -25 ºC (-13 ºF) to 45 ºC (113 ºF).

The data are available at resolutions of 1.0, 0.5, 0.25, and 0.1 degrees. Degrees of latitude are approximately 69 miles (111 kilometers) apart, so the 0.1 degrees files contain land temperature readings spaced approximately 6.9 miles (11.1 km) apart north and south. Unlike latitude, the distance between degrees of longitude varies by latitude. The distance is greatest at the equator and gradually shrinks to zero at the poles. For instance, at the equator, degrees of longitude are approximately 69 miles (111 km) apart; whereas, at 40º north and south, degrees of longitude are approximately 53 miles (85 km) apart. To take advantage of the most fine-grained data available, we’ll use the 0.1 degrees files in this post.

nasa_neo_land_surface_temp_website

Create Environment

To begin, let’s create a dedicated folder and Python environment for this project. The following commands create a new folder named land_temperature and, inside it, another folder named input_files and then move you into the land_temperature folder:

mkdir -p land_temperature/input_files

cd land_temperature

The following conda commands create and activate a new Python 3.5 environment named land_temp that includes the listed packages, as well as their dependencies. If you’re not using the Anaconda distribution of Python, you can use the venv module in Python 3’s standard library to create a similar dedicated environment:

conda create --name land_temp python=3.5 pandas xarray scrapy matplotlib seaborn cartopy jupyter

source activate land_temp

Create Web Page URLs

Now that we’ve activated our dedicated Python environment, let’s inspect NASA NEO’s land surface temperature web page URL to determine how we’ll need to change it to access all of the web pages we need to visit. The URL for the month-level data for 2001-01-01 is:

"https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date=2001-01-01"

If you change the 2001 to 2002 and refresh the page, you’ll see you’re now viewing the month-level data for 2002-01-01. If you make a similar change to the month, you’ll see you’re now viewing data for a different month. It appears we can simply change the date in the URL to access all of the month-level files from 2001 to 2016. Let’s use pandas in the ipython interactive shell to generate this list of URLs:

ipython

import pandas as pd

start_date = '2001-01-01'
end_date = '2016-12-01'

dates = pd.date_range(start=start_date, end=end_date, freq='MS')

dates = [dt.strftime('%Y-%m-%d') for dt in dates]

url_base = "https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date="

urls = [url_base+dt for dt in dates]

create_webpage_urls

Inspect Web Page HTML

Now that we have the web page URLs we need, let’s use Chrome’s element inspection tool and the scrapy interactive shell to determine how to extract the links to the data files from the web pages. To start, let’s click on the File Type dropdown menu to see what file types are available. There are several options, but let’s plan to download the type, CSV for Excel.

Below the File Type dropdown menu, there are four geographic resolution options, 1.0, 0.5, 0.25, and 0.1 degrees, which provide increasingly granular data. Let’s right-click on the tiny, right-facing arrow to the right of 0.1 degrees 3600 x 1800 and select Inspect to inspect the HTML near the link in Chrome’s element inspection tool.

chrome_inspection_tools

The HTML shows us the link to the data file is in a table. Moreover, the link is in a row that has class=”size-option” and, within the data cell (td) element, it is in a hyperlink (a) element’s href attribute. With this understanding of the HTML path to the data file link, let’s use scrapy’s interactive shell to figure out how to extract the link:

scrapy shell "https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date=2001-01-01"

response.css('tr.size-option td a::attr(href)').extract()

response.css('tr.size-option td a::attr(href)')[-1].extract()

scrapy_extract_links

If you inspect a few of the data file links, you’ll notice an issue with them (i.e. a number in the middle of the URL that varies) that we need to address if we want to download the files programmatically:

"http://neo.sci.gsfc.nasa.gov/servlet/RenderData?si=869628&cs=rgb&format=SS.CSV&width=3600&height=1800"

In the previous section, when we generated the web page URLs, the portion of the URL that needed to change was the date at the end of the URL, and it needed to change in an understandable way. In this case, I don’t know which number is associated with each URL (and I can’t guess the underlying pattern if there is one), so I can’t generate them programmatically. Instead of generating the data file links like the web page URLs in the previous section, let’s simply scrape the actual data file links from the web pages.

Scrape Data File URLs

Now that we know how to select the data file links, let’s use scrapy to extract them from the web pages so we can then use them to download the data files. In total, there will be 192 URLs and files (12 months per year x 16 years = 192 monthly files).

From inside the land_temperature folder, type the following commands:

scrapy startproject scrape_land_temps

cd scrape_land_temps

Now that we’re inside the first scrape_land_temps folder, let’s create a scrapy spider, a Python file, named land_temp_csv_files_spider.py inside the scrape_land_temps/spiders folder. In the spider, let’s combine our web page URL generation code with our href link extraction code to instruct the spider to visit each of the 192 month-level web pages and extract the link to the 0.1 degrees data file from each page. Then we can use these URLs to download the CSV files:

import scrapy
import pandas as pd

class LandTempCSVFilesSpider(scrapy.Spider):
        name = "land_temp_csv_files"

        def start_requests(self):
                start_date = '2001-01-01'
                end_date = '2016-12-01'
                dates = pd.date_range(start=start_date, end=end_date, freq='MS')
                dates = [dt.strftime('%Y-%m-%d') for dt in dates]
                url_base = 'https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date='
                urls = [url_base+dt for dt in dates]

                for url in urls:
                        yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
                dt = response.url.split("=")[-1]
                url = response.css('tr.size-option td a::attr(href)')[-1].extract()
                url = url.replace('JPEG', 'SS.CSV')
                yield {'date': dt, 'url': url}

scrapy_spider

Let’s use the following command to run the spider and extract the links to the data files. The result is a JSON file named land_temp_csv_files_urls.json that contains an array of 192 objects, each containing a date and the link to the data file associated with the date:

scrapy crawl land_temp_csv_files -o ../land_temp_csv_files_urls.json

cd ..

Download Data Files

We’re finally ready to download the 192 month-level land surface temperature data files. Let’s return to the ipython interactive shell and use the following code to iterate through the array of URLs in our JSON file to download the CSV files.

First, we read the pairs of dates and URLs in the JSON file into a dataframe named ‘df’. Next, we loop over these pairs (i.e. rows in the dataframe) and, for each one, use the URL to read the remote CSV file into a dataframe named ‘dat’ and then write the dataframe to a local file in the input_files folder.

We insert the date, e.g. 2001-01-01, into the filenames so we know which date each file represents. Also, we use try-except blocks around the reading and writing operations so the loop won’t terminate if it runs into any issues (instead, it will print messages to the screen):

ipython

import pandas as pd

df = pd.read_json('land_temp_csv_files_urls.json')

for index, row in df.iterrows():
        print(index, row['url'])
        try:
                dat = pd.read_csv(row['url'], header=0, index_col=0)
        except:
                print('Error reading: {}'.format(row['url']))
                continue

        filename = 'MOD11C1_M_LSTDA_{}_rgb_3600x1800.SS.CSV'.format(row['date'].strftime('%Y-%m-%d'))
        try:
                dat.to_csv('input_files/'+filename)
        except:
                print('Error writing: {}'.format(filename))
                continue

download_and_write_csv_files

Combine Data

Now that we have all of our data files, let’s return to the ipython interactive shell and use the following code to read and combine all of the CSV files into a three-dimensional array (i.e. x = longitude, y = latitude, z = date):

import xarray as xr
import numpy as np
import pandas as pd
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
from pathlib import Path

The following code snippet is a helper function we’ll use to make the file-reading code shown below easier to read. This function takes a file name as input, splits it into pieces at the underscores, extracts the piece with index position 3 (this piece is the date, e.g. 2001-01-01), and converts the date into a datetime object:

def date_from_filename(filename):
        fn = filename.name
        dt = pd.to_datetime(fn.split('_')[3])
        return dt

The following code snippet is another helper function we’ll use to make the file-reading code easier to read. This function takes an array as input, converts all of the array elements into floating-point numbers, rounds all of the numbers to a specified number of decimal places (the default is 2 decimal places), and then converts the elements to string type:

def round_coords(arr, d=2):
        vals = map(str, [round(float(val), d) for val in arr])
        return vals

The following line of code uses the pathlib module in Python’s standard library to create and return a sorted list of paths to all of the CSV files in the input_files folder:

files = sorted(Path("input_files/").glob("*.CSV"))

The block of code shown below reads all of the CSV files and combines them into a three-dimensional array (i.e. x = longitude, y = latitude, z = date). We’ll use the list named ‘das’ to collect the 192 individual arrays. Later, we’ll pass this list of arrays to xarray’s concat function to concatenate them into a new, combined array. Similarly, we’ll use the list named ‘dts’ to collect the 192 dates so we can use them as the new dimension in the combined array.

Next, we start to loop through each of the CSV files. For each file, we use the date_from_filename function to extract the date from the filename and append it into the dts list. Next, we read the CSV file, noting that the first row is the header row of longitude values, the first column is the index of latitude values, and NA data values are coded as 99999.0. The next three lines round the data, longitude, and latitude values to two decimal places.

Next, we input these values into xarray’s DataArray constructor to create a two-dimensional array and add it to the das list. Finally, we use xarray’s concat function to combine the 192 two-dimensional arrays into a three-dimensional array with the new dimension named ‘date’:

das = []
dts = []
for input_file in files:
        dt = date_from_filename(input_file)
        dts.append(dt)
        df = pd.read_csv(input_file, header=0, index_col=0, na_values=99999.0)
        df = df.round(2)
        df.columns = round_coords(df.columns)
        df.index = round_coords(df.index)
        da = xr.DataArray(df.values,
                coords=[[float(lat) for lat in df.index], [float(lon) for lon in df.columns]],
                dims=['latitude', 'longitude'])
        das.append(da)

da = xr.concat(das, pd.Index(dts, name='date'))

xarray_read_all_csv_files

At this point, we should have a three-dimensional array named ‘da’ we can use to analyze and visualize land surface temperatures from 2001 to 2016. Let’s check to make sure the array has the expected dimensions and appears to have the right content:

da.shape

da

The temperature values are in degrees Celsius. Nearly everyone in the world learns this temperature scale, except for people in the United States. Since many readers live outside the United States, I’m going to leave the values in degrees Celsius; however, converting them to degrees Fahrenheit is straightforward:

da.values = (da.values * 1.8) + 32
da = da.round(2)

Average Land Surface Temperatures

xarray extends pandas and numpy functionality to facilitate multi-dimensional indexing, grouping, and computing. As an example, we can calculate the average land surface temperatures across all 192 months and display them on a map with the following code:

da.mean(dim='date').plot(figsize=(10, 6));
plt.show()

192-month-average-plate-caree

The PlateCaree projection is a nice default, but let’s explore some other map projections.

cartography_west_wing

Remembering the scene about projections in the television show, The West Wing, here is the same data displayed on a map with the Mollweide projection:

plt.figure(figsize=(10, 6))
ax_p = plt.gca(projection=ccrs.Mollweide(), aspect='auto')
da.mean(dim='date').plot.imshow(ax=ax_p, transform=ccrs.PlateCarree());
plt.show()

192-month-average-mollweide

As an additional example, the following code block displays the data on a map with the Robinson projection. This example also illustrates some of the additional arguments you can supply to the plot.imshow function to customize the plot:

fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection=ccrs.Robinson(), aspect='auto')
da.mean(dim='date').plot.imshow(ax=ax, transform=ccrs.PlateCarree(),
                x='longitude', y='latitude',
                robust=True, # vmin=-25, vmax=40,
                cmap='RdBu_r',
                add_colorbar=True,
                extend='both');
plt.show()

192-month-average-robinson

By Specific Geographic Area

The previous examples displayed maps of the entire Earth. In some cases, you may only be interested in a specific segment of the globe. In these cases, you can use array indexing to filter for the subset of data you want or use cartopy’s set_extent function to restrict the map to a specific geographic area.

If you use array indexing, be sure to check the ordering of your array’s axes so you place your index values or ranges in the right positions. For example, in our ‘da’ array the ordered dimensions are date, latitude, and longitude (which we can check with da.shape), so the indexing in the following command selects all dates, latitudes between 20.05 and 50.05, and longitudes between -66.50 and -125.05:

usa = da.loc[:, 50.05:20.05, -125.05:-66.50]
usa.mean(dim='date').plot();
plt.show()

usa-192-month-average-plate-caree

Alternatively, we can use cartopy’s set_extent function to restrict the map to a specific segment of the globe:

plt.figure(figsize=(10, 6))
ax_p = plt.gca(projection=ccrs.LambertConformal(), aspect='auto')
ax_p.set_extent([-125.05, -66.50, 20.05, 50.05])
usa.mean(dim='date').plot.imshow(ax=ax_p, transform=ccrs.PlateCarree());
plt.show()

usa-192-month-average-lambert-conformal

By Month of the Year

The previous plots calculated average land surface temperatures across all 192 months, which doesn’t let us see temperature differences among months of the year, i.e. January, February, …, December. To calculate average temperatures for each month, we can use xarray’s groupby function to group our data by month of the year and then calculate average temperatures for these groups:

by_month = da.groupby(da.date.dt.month).mean(dim='date')
by_month.plot(x='longitude', y='latitude', col='month', col_wrap=4);
plt.show()

12-monthly-averages-plate-caree-square

By Season

In xarray’s documentation, Joe Hamman demonstrates how to calculate season averages with weighted averages that account for the fact that months have different numbers of days. Slightly adapting his code for our dataset, we can view how global land surface temperatures vary across seasons (to run the code shown below, you’ll first need to copy and paste Joe’s dpm dictionary and leap_year and get_dpm functions):

month_length = xr.DataArray(get_dpm(da.date.to_index(), calendar='noleap'),
                coords=[da.date], name='month_length')
weights = month_length.groupby('date.season') / month_length.groupby('date.season').sum()
np.testing.assert_allclose(weights.groupby('date.season').sum().values, np.ones(4))

da_weighted = (da * weights).groupby('date.season').sum(dim='date')

fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15,4))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
        if i == 3:
                da_weighted.sel(season=season).plot(
                        ax=axes[i], robust=True, cmap='RdBu_r', #'Spectral_r',
                        add_colorbar=True, extend='both')
        else:
                da_weighted.sel(season=season).plot(
                        ax=axes[i], robust=True, cmap='RdBu_r',
                        add_colorbar=False, extend='both')

for i, ax in enumerate(axes.flat):
        if i > 0:
                ax.axes.get_xaxis().set_ticklabels([])
                ax.axes.get_yaxis().set_ticklabels([])
                ax.set_ylabel('')
                ax.set_xlabel('')

axes[0].set_title('Weighted by DPM')

4-seasonal-averages-plate-caree-wide

Looping through Months of the Year

The previous examples generated static images. While you can certainly scan over the month of year and season-based plots to inspect differences among the time periods, it can be helpful to generate the plots in a loop so you can focus on a geographic area of interest and let the program handle transitioning from one time period to the next.

Since we’re going to loop over time periods, e.g. months of the year, I’d like to label each plot so we know which time period is being displayed. In our dataset, the months are numbered from 1 to 12. I want to be able to refer to January instead of month 1, so let’s create a dictionary that maps the month number to the corresponding name:

months = {
        1: 'January',
        2: 'February',
        3: 'March',
        4: 'April',
        5: 'May',
        6: 'June',
        7: 'July',
        8: 'August',
        9: 'September',
        10: 'October',
        11: 'November',
        12: 'December'
}

Next, let’s write a function that will generate each plot, e.g. one for each month of the year. Inside the function, let’s use matplotlib’s clf function to clear the current figure, create a plot axis with the Robinson projection, filter for the subset of arrays with the specified month of the year, create a plot of the average land surface temperatures in the month across all sixteen years, and finally use the name of the month as the plot title:

def plot_monthly(month_number):
        plt.clf()
        ax_p = plt.gca(projection=ccrs.Robinson(), aspect='auto')
        d = da.loc[da.date.dt.month == month_number]
        d.mean(dim='date').plot.imshow(ax=ax_p, transform=ccrs.PlateCarree())
        plt.title('{}'.format(months[month_number]))

To generate the twelve month-based plots, let’s use matplotlib’s ion and figure functions to turn on matplotlib’s interactive mode and to create an initial figure. Next, let’s establish a for loop to iterate through the integers 1 to 12. As we loop through the integers, we’ll pass each one into our plot_monthly function so it creates a plot based on the data for that month. Since we’re using interactive mode, we need to use matplotlib’s pause function, which pauses the figure for the specified number of seconds, to facilitate the transition behavior. Similarly, we need to use the draw function to update the figure after each transition:

plt.ion()
plt.figure(figsize=(10, 6))
for month in range(1,13):
        plot_monthly(month)
        plt.pause(0.1)
        plt.draw()

12-monthly-averages

Conclusion

This post demonstrated how to acquire, analyze, and visualize sixteen years’ worth of global land surface temperature data with Python. Along the way, it illustrated ways you can use xarray, matplotlib, and cartopy to select, group, aggregate, and plot multi-dimensional data.

The data set we used in this post required a considerable portion of my laptop’s memory, but it still fit in memory. When the data set you want to use doesn’t fit in your computer’s memory, you may want to consider the Python package, Dask, “a flexible parallel computing library for analytic computing”. Dask extends numpy and pandas functionality to larger-than-memory data collections, such as arrays and data frames, so you can analyze your larger-than-memory data with familiar commands.

Finally, while we focused on land surface temperature data in this post, you can use the analysis and visualization techniques we covered here on other data sets. In fact, you don’t even have to leave the website we relied on for this post, NASA’s NEO website. It offers dozens of other environmental data sets, categorized under atmosphere, energy, land, life, and ocean. This post only scratched the surface of what is possible with xarray and NASA’s data. I look forward to hearing about the cool, or hot, ways you use these resources to study our planet : )

Advertisements

Scraping, Geocoding, and Mapping Points with Scrapy, Geopy, and Leaflet

Displaying points of interest on maps is fun and can be an informative first step in geospatial analysis.  The task is relatively straightforward when the data already contain the points’ latitudes and longitudes.  Sometimes, however, the data don’t contain this information, e.g. when you simply have a list of addresses.  In this case, you can geocode the addresses to determine their latitudes and longitudes in order to display the points on a map.

Let’s tackle this situation to demonstrate how to geocode addresses and display the points on a map.  Since I live in California, I searched online for interesting points of interest in the western region of the United States and found this page of attractions for RV travelers, organized by state.  If you prefer to skip this web scraping section, you can find the resulting data, along with all of the files associated with this tutorial, in this Github repository.

Screen Shot 2017-12-12 at 12.17.18 PM

Each listing includes a name/title and the address, phone number, cost, website, and latitude and longitude for the attraction.  Let’s scrape this information from the page so we can geocode the addresses and also display the additional details about the attractions.  While the listings include latitude and longitude information, let’s ignore it for now and pretend we only have the addresses so we can demonstrate geocoding.

Project Environment

To begin, let’s create a project folder and a Python virtual environment for this tutorial:

mkdir points_of_interest
cd points_of_interest

conda create -n poi python=2.7 scrapy pandas geopy

source activate poi

pip install --upgrade lxml
pip install --upgrade cryptography
pip install --upgrade parsel
pip install --upgrade w3lib
pip install --upgrade twisted
pip install --upgrade pyOpenSSL
conda install -c conda-forge scrapy
pip install --upgrade geopy

The conda create statement creates a Python 2.7 virtual environment named poi and installs scrapy, pandas, and geopy in the environment.  The source activate statement activates the environment.  The additional pip install –upgrade statements ensure the main underlying packages are up-to-date (some of them were not for me and I needed to run these commands before scrapy and geopy would work correctly).

scrapy

scrapy is a great Python package for web scraping. Let’s use it to scrape the data from the page of western attractions. To create a new scrapy project named western_attractions, run the following command:

scrapy startproject western_attractions

To scrape the page, we need to create a spider, so run the following command to create a spider named attractions_spider.py:

touch western_attractions/spiders/attractions_spider.py

Before attempting to scrape the data from the page, let’s inspect the page’s element’s using Chrome’s inspect console. From here, we can see that the attraction titles are in h4 elements and the data are inside p elements inside blockquote elements.

Screen Shot 2017-12-12 at 12.14.26 PM

Now that we know which elements contain the data we want to extract, we can use scrapy shell to test selector commands, methods of extracting specific pieces of data from a page, before incorporating them into the spider. To use scrapy shell on this page, run the following command:

scrapy shell 'http://www.jurnii.com/rv/rv_guide/us_western_regions_attractions.php'

Now let’s see if we can extract the attraction titles by selecting the text inside h4 elements. To do so, run the following command:

response.css("h4::text").extract()

Similarly, we can select all of the blockquotes and website links associated with the attractions with the following commands:

response.css("blockquote").extract()

response.css('blockquote p a::attr(href)').extract()

Now that we have an idea of the commands we’ll need to use to extract the data we’re interested in, let’s start incorporating them into a spider. Open western_attractions/spiders/attractions_spider.py in an editor and add the following code:

import re
import scrapy

class WesternAttractionsSpider(scrapy.Spider):
    name = 'western_attractions'
    attractions_url = 'http://www.jurnii.com/rv/rv_guide/us_western_regions_attractions.php'
    start_urls = [attractions_url]

    def parse(self, response):
        # Extract names/titles of attractions
        h4s = response.css("h4::text").extract()
        h4s = [val.encode('ascii','ignore') for val in h4s]
        h4s = [re.sub("\s+", " ", val).strip() for val in h4s]
        h4s = [val for val in h4s if "Address" not in val and "Phone" not in val]
        h4s = filter(None, h4s)

        # Extract website URLs for attractions
        links = response.css('blockquote p a::attr(href)').extract()

        # Extract details associated with each attraction
        for idx, bq in enumerate(response.css("blockquote")):
            data = bq.css("p.smaller_par::text").extract()
            data = [val.encode('ascii','ignore') for val in data]
            data = [re.sub("\s+", " ", val).strip() for val in data]
            data = filter(None, data)
            data.append(links[idx])
            if len(data) == 5:
                address = data[0]
                phone = data[1].replace("Phone: ", "")
                cost = data[2]
                lat_lon = data[3]
                website = data[4]
                yield { "title": h4s[idx], "address": address, "phone": phone,
                    "cost": cost, "lat_lon": lat_lon, "website": website }

To explain this code, let’s start at the top and work our way down. First, we import the re and scrapy packages so we can use them in the script. We need the re package to perform some pattern-based substitutions to clean the raw data. We need the scrapy package create the spider.

We create a spider by creating a class named WesternAttractionsSpider, which is a subclass of scrapy.Spider. We give our spider a name, western_attractions, and provide the url of the page we want to scrape.

The rest of the code, in the parse method, specifies how to parse the page content and extract the data we’re interested in. The first code block extracts the names/titles of the attractions contained in the h4 elements. You’ll notice the first line of code is the same line we used in scrapy shell. The remaining four lines of code clean the data — the first one removes non-ascii characters, the second one removes extra spacing in the strings, the third one removes strings that don’t contain the names/titles of attractions, and the fourth one removes blank elements in each list.

The middle line of code extracts all of the website urls for the attractions. While this list of urls is currently separate from the rest of the attractions data, we’ll associate the urls with the attractions in the next code block.

The final code block extracts the all of the details associated with each attraction, which are contained in blockquote elements. We iterate over the blockquote elements to extract the details associated with each attraction. By inspecting the elements in Chrome’s inspect console, we know the details are in p elements that all have the same class, smaller_par. Therefore, the CSS selector, bq.css(“p.smaller_par::text”).extract(), generates a list of all of the details inside the blockquote element.

Similar to the code for the h4 elements, the next three lines clean the values in the list of details. The next line uses the blockquote index to identify the website url associated with the attraction and appends the url to the list of details associated with the attraction. Given the details we want to extract from the page, i.e. address, phone, cost, website, and latitude/longitude, the list associated with each attraction should contain five elements. We test the length of the list to ensure we only extract records that contain five data elements.

Finally, to extract the data we need to yield a dictionary, so we separate the data in the list into five variables, clean up a few phone number entries, and yield a dictionary with the attraction’s title and five details.

Screen Shot 2017-12-12 at 12.25.37 PM

To scrape the page and save the data in a JSON file, run the following command:

scrapy crawl western_attractions -o western_attractions.json

geopy

Now that we have addresses in a JSON file we can focus on geocoding the addresses. We’ll use geopy, and Google Maps V3 API, to geocode the addresses. To use Google Maps V3 API, you need to acquire an API key, which you can do here.

Now that you have a Google Maps V3 API key, create a new script named geocode_points_of_interest.py and add the lines of code shown below.

First, we import GoogleV3 and GeocoderTimedOut from geopy to perform geocoding and catch timeout errors. Next, we import pandas to manage the data, including reading the input JSON file, merging the input data with the newly geocoded data, and writing the data to output files.

#!/usr/bin/env python
from geopy.geocoders import GoogleV3
from geopy.exc import GeocoderTimedOut

import pandas as pd

The next line initializes a Google locator / geocoder we’ll use to identify the latitudes and longitudes associated with our addresses. Be sure to replace YOUR_API_KEY with the API key you generated in the previous step.

google_locator = GoogleV3(api_key="YOUR_API_KEY")

Next, let’s create a function we can use to geocode the addresses. The geocoder may not be able to locate and geocode an address. It might time out as well. We’ll use try except blocks to handle these cases so the script doesn’t fail for one of these reasons.

If the geocoder returns a location, then we’ll separate the address, latitude, and longitude into separate variables and return them.

def geocode_address(address, geolocator):
    """Google Maps v3 API: https://developers.google.com/maps/documentation/geocoding/"""
    # https://stackoverflow.com/questions/27914648/geopy-catch-timeout-error
    try:
        location = geolocator.geocode(address, exactly_one=True, timeout=5)
    except GeocoderTimedOut as e:
        print("GeocoderTimedOut: geocode failed on input %s with message %s" % (address, e.msg))
    except AttributeError as e:
        print("AttributeError: geocode failed on input %s with message %s" % (address, e.msg))
    if location:
        address_geo = location.address
        latitude = location.latitude
        longitude = location.longitude
        return address_geo, latitude, longitude
    else:
        print("Geocoder couldn't geocode the following address: %s" % address)

When we map the data, it will be fun and helpful to be able to color it by the state in which the attraction in located. We can extract the two-letter state abbreviation from the address with basic string parsing, but through trial and error I found some of the addresses didn’t contain the two-letter state abbreviation. Let’s create the following helper function to convert all of the state locations into their two-letter abbreviations.

def convert_state_to_two_letter(state_abbreviation):
    if state_abbreviation == 'California':
        state_abbreviation = 'CA'
    if state_abbreviation == 'Idaho':
        state_abbreviation = 'ID'
    if state_abbreviation == 'Boulder' or state_abbreviation == 'Tahoe,':
        state_abbreviation = 'NV'
    else:
        state_abbreviation = state_abbreviation
    return state_abbreviation

Screen Shot 2017-12-12 at 1.20.33 PM

To begin, let’s read the JSON data we generated in the web scraping section into a pandas DataFrame and then use our convert_state_to_two_letter function to create a new column that contains the two-letter state abbreviations.

df = pd.read_json('western_attractions/western_attractions.json', orient='records')
df['state'] = df['address'].apply(lambda address: convert_state_to_two_letter(address.split()[-2]))

Now it’s time to use our geocode_address function to identify the latitudes and longitudes of our addresses. We use a for loop to iterate over the DataFrame rows, each of which represents an attraction, and use the geocode_address function to geocode the address. We collect the geocoding results into a dictionary and, if the function returns a geocoded address, we append the results into a list so the final result will be a list of dictionaries we can convert into a new pandas DataFrame.

geo_results = []
for index, row in df.iterrows():
    try:
        result = geocode_address(row.loc['address'], google_locator)
        d = {'index': index, 'address_geo': result[0], 'latitude': result[1],
            'longitude': result[2]}
        if d['address_geo'] is not None:
            geo_results.append(d)
            print(d)
    except:
        print(row)
        continue

We want to merge the latitude and longitude data with the existing data about each attraction, so we convert the geocoded data into a DataFrame and then inner join the two DataFrames together. We’re using an inner join in this tutorial so we can proceed with attractions that were successfully geocoded. If you need to keep all of your original addresses, even if they can’t be geocoded, then you can use a left join.

geo = pd.DataFrame(geo_results)
geo.set_index('index', inplace=True)
df_geo = df.merge(geo, how='inner', left_index=True, right_index=True)

Now that the resulting DataFrame contains latitude and longitude data for each attraction, in addition to the original details, we can write the data to files. The CSV file is simply a convenient format for tabular data and spreadsheet programs. In this case, since our intention is to map the data, the more important file to write is the JSON file.

df_geo.to_csv('western_attractions_geocoded.csv', index=False)
df_geo.to_json('western_attractions_geocoded.json', orient='records')

Screen Shot 2017-12-12 at 1.21.49 PM

The input and output filenames are hardcoded in the script (feel free make the script more flexible with sys.argv[]). To geocode the addresses, run the following command:

./geocode_points_of_interest.py

The script will read in western_attractions/western_attractions.json and then write out western_attractions_geocoded.csv and western_attractions_geocoded.json.

Convert JSON to GeoJSON

A JSON file isn’t quite what we need to start mapping the data. To map the data, we need to convert the JSON data into GeoJSON. You can choose from several tools to convert JSON to GeoJSON, including:

OGR
GeoJSON
Javascript geojson
Python Script

Since this probably won’t be the last time we work on a mapping project and need to convert JSON to GeoJSON, let’s copy the Python code from the last link listed above into a script that we’ll be able to reuse. Here’s the script we’ll use to create the GeoJSON:

#!/usr/bin/env python
from sys import argv
from os.path import exists
import simplejson as json

script, in_file, out_file = argv

data = json.load(open(in_file))

geojson = {
    "type": "FeatureCollection",
    "features": [
    {
        "type": "Feature",
        "geometry" : {
            "type": "Point",
            "coordinates": [d["longitude"], d["latitude"]],
            },
        "properties" : d,
    } for d in data]
}

output = open(out_file, 'w')
json.dump(geojson, output)

print geojson

Screen Shot 2017-12-12 at 1.22.54 PM

To convert the JSON into GeoJSON, run the following command:

./json2geojson.py western_attractions_geocoded.json western_attractions_geocoded.geojson

The script prints the resulting GeoJSON to the screen, in addition to writing it to the output file, so you’ll know when the script’s finished.

Make a Map

Now that we have a GeoJSON file that contains details about the western attractions, including their latitudes and longitudes, we can work on displaying the data on a map. Here again we have lots of options:

GeoPandas
Mapbox
Mapzen
D3
Leaflet

All of these tools are great options. In this tutorial, we’ll use Leaflet, along with Mapbox, to display our attractions on a map. Leaflet is convenient because it has easy-to-learn syntax and helpful tutorials, but it’s an arbitrary choice, so feel free to use a different mapping tool.

To begin, let’s create an HTML file named western-attractions.html and add the code in the screen shots. Most of the code is HTML boilerplate. Within the head section, we need to add Leaflet’s JS and CSS files. We’ll also add D3’s JS file so we can use it to read the GeoJSON data file. Inside the body section we add a div element with an id=”westernAttractionsMap” to contain the map we’re going to create.

Let’s add a small amount of styling in the head section to specify the document margins, the dimensions of the map, and the size and weight of the text in the popups.

Screen Shot 2017-12-12 at 1.24.32 PM

Finally, let’s add the Javascript-Leaflet code we need to generate the map of the western attractions. First, we create a variable reference to our map, westernAttractionsMap, and specify the initial latitude, longitude, and zoom level.

Next, we use the Mapbox API to add a tile layer to the map. If you don’t already have a Mapbox API access token, you need to go here to create a free account and generate a free access token. Once you have an API token, be sure to replace YOUR_MAPBOX_ACCESS_TOKEN in the Mapbox API URL in the L.tileLayer() call with your actual Mapbox API token. We set the minimum and maximum zoom levels, use the id to specify the tile layer style, and then add the layer to the map.

Finally, we use d3.json to read the GeoJSON file and extract the data we want to display on the map. The onEachFeature function generates a popup for each attraction containing the attraction’s title, cost, address, and website. We use a capitalizeWords function, defined below, on the title variable to capitalize each word in the title. In addition, we use the state attribute and a color_state function, defined below, inside a style element to color the text inside the popup based on the state in which the attraction is located. We use ternary operators for each detail to display the data if it’s available or an empty string if it isn’t available.

Leaflet’s L.geoJSON function adds the attractions, i.e. GeoJSON objects, to a layer, applies the onEachFeature function to each object to associate a popup with each attraction, and then adds the layer to the map.

Screen Shot 2017-12-12 at 1.26.30 PM

The capitalizeWords and color_state functions are simple helper functions to format the attraction titles and to color the popup text. The capitalizeWords function ensures the titles are displayed consistently by capitalizing each word in the title. The color_state function makes it easier to differentiate between states by using different colors for the text in the popups for attractions in different states.

Screen Shot 2017-12-12 at 1.27.44 PM

There are different ways to view your map. For example, you can use a Python-based server with one of the following commands, depending on your version of Python:

python -m http.server 3031 (Python 3.x) or
python -m SimpleHTTPServer 3031 (Python 2.x)

Alternatively, you can install http-server and then run the server with the following command:

http-server -p 3031

Screen Shot 2017-12-12 at 1.12.58 PM

Once the server is running, click on western-attractions.html to open your map in a browser. Click on a few of the pins to view the details associated with the attractions, and click on pins in different states to see the text color change. We’ve also made the website’s active URLs so you can click on them to go to the attraction’s official web page.

Screen Shot 2017-12-12 at 1.14.19 PM

Screen Shot 2017-12-12 at 1.15.48 PM

Conclusion

This tutorial covered scraping data from a web page, geocoding addresses, and displaying points on a map.

In some cases, your project may only require readily-available geographic data, in which case you can skip to the final section of this tutorial and focus on displaying the data on a map. In other cases, you may only have addresses or no geographic data at all, and in these cases the first two sections on scraping web data and geocoding it will be more valuable.

I hope that by following along with this tutorial and experimenting with the techniques you now feel more comfortable scraping, geocoding, and mapping data.

Data Wrangling in Julia based on dplyr Flights Tutorials

A couple of my favorite tutorials for wrangling data in R with dplyr are Hadley Wickham’s dplyr package vignette and Kevin Markham’s dplyr tutorial. I enjoy the tutorials because they concisely illustrate how to use a small set of verb-based functions to carry out common data wrangling tasks.

I tend to use Python to wrangle data, but I’m exploring the Julia programming language so I thought creating a similar dplyr-based tutorial in Julia would be a fun way to examine Julia’s capabilities and syntax. Julia has several packages that make it easier to deal with tabular data, including DataFrames and DataFramesMeta.

The DataFrames package provides functions for reading and writing, split-apply-combining, reshaping, joining, sorting, querying, and grouping tabular data. The DataFramesMeta package provides a set of macros that are similar to dplyr’s verb-based functions in that they offer a more convenient, readable syntax for munging data and chaining together multiple operations.

Data

For this tutorial, let’s following along with Kevin’s tutorial and use the hflights dataset. You can obtain the dataset from R with the following commands or simply download it here: hflights.csv

install.packages("hflights")
library(hflights)
write.csv(hflights, "hflights.csv")

Load packages and example dataset

To begin, let’s start the Julia REPL, load the DataFrames and DataFramesMeta packages, and load and inspect the hflights dataset:

using DataFrames
using DataFramesMeta

hflights = readtable("/Users/clinton/Documents/Julia/hflights.csv");
size(hflights)
names(hflights)
head(hflights)
describe(hflights)

hflights1b

The semicolon on the end of the readtable command prevents it from printing the dataset to the screen. The size command returns the number of rows and columns in the dataset. You can specify you only want the number of rows with size(hflights, 1) or columns with size(hflights, 2). This dataset contains 227,496 rows and 21 columns. The names command lists the column headings. By default, the head command prints the header row and six data rows. You can specify the number of data rows to display by adding a second argument, e.g. head(hflights, 10). The describe command prints summary statistics for each column.

@where: Keep rows matching criteria

AND: All of the conditions must be true for the returned rows

# Julia DataFrames approach to view all flights on January 1
hflights[.&(hflights[:Month] .== 1, hflights[:DayofMonth] .== 1), :]

# DataFramesMeta approach
@where(hflights, :Month .== 1, :DayofMonth .== 1)

hflights2

Julia’s DataFrames’ row filtering syntax is similar to R’s syntax. To specify multiple AND conditions, use “.&()” and place the filtering conditions, separated by commas, between the parentheses. Like dplyr’s filter function, DataFramesMeta’s @where macro simplifies the syntax and makes the command easier to read.

OR: One of the conditions must be true for the returned rows

# Julia DataFrames approach to view all flights where either AA or UA is the carrier
hflights[.|(hflights[:UniqueCarrier] .== "AA", hflights[:UniqueCarrier] .== "UA"), :]

# DataFramesMeta approach
@where(hflights, .|(:UniqueCarrier .== "AA", :UniqueCarrier .== "UA"))

hflights3

To specify multiple OR conditions, use “.|()” and place the filtering conditions between the parentheses. Again, the DataFramesMeta approach is more concise.

SET: The values in a column are in a set of interest

# Julia DataFrames approach to view all flights where the carrier is in Set(["AA", "UA"])
carriers_set = Set(["AA", "UA"])
hflights[findin(hflights[:UniqueCarrier], carriers_set), :]

# DataFramesMeta approach
@where(hflights, findin(:UniqueCarrier, carriers_set))

hflights4

To filter for rows where the values in a particular column are in a specific set of interest, create a Set with the values you’re interested in and then specify the column and your set of interest in the findin function.

PATTERN / REGULAR EXPRESSION: The values in a column match a pattern

# Julia DataFrames approach to view all flights where the carrier matches the regular expression r"AA|UA"
carriers_pattern = r"AA|UA"
hflights[[ismatch(carriers_pattern, String(carrier)) for carrier in hflights[:UniqueCarrier]], :]

# DataFramesMeta approach
@where(hflights, [ismatch(carriers_pattern, String(carrier)) for carrier in :UniqueCarrier])

hflights5

To filter for rows where the values in a particular column match a pattern, create a regular expression and then use it in the ismatch function in an array comprehension.

@select: Pick columns by name

# Julia DataFrames approach to selecting columns
hflights[:, [:DepTime, :ArrTime, :FlightNum]]

# DataFramesMeta approach
@select(hflights, :DepTime, :ArrTime, :FlightNum)

Julia’s DataFrames’ syntax for selecting columns is similar to R’s syntax. Like dplyr’s select function, DataFramesMeta’s @select macro simplifies the syntax and makes the command easier to read.

# Julia DataFrames approach to selecting columns
# first three columns
hflights[:, 1:3]
# pattern / regular expression
heading_pattern = r"Taxi|Delay"
hflights[:, [ismatch(heading_pattern, String(name)) for name in names(hflights)]]
# startswith
hflights[:, filter(name -> startswith(String(name), "Arr"), names(hflights))]
# endswith
hflights[:, filter(name -> endswith(String(name), "Delay"), names(hflights))]
# contains
hflights[:, filter(name -> contains(String(name), "Month"), names(hflights))]

# AND conditions
hflights[:, filter(name -> startswith(String(name), "Arr") && endswith(String(name), "Delay"), names(hflights))]
# OR conditions
hflights[:, filter(name -> startswith(String(name), "Arr") || contains(String(name), "Cancel"), names(hflights))]

hflights6

# DataFramesMeta approach
# first three columns
@select(hflights, 1:3)
# pattern / regular expression
heading_pattern = r"Taxi|Delay"
@select(hflights, [ismatch(heading_pattern, String(name)) for name in names(hflights)])
# startswith
@select(hflights, filter(name -> startswith(String(name), "Arr"), names(hflights)))
# endswith
@select(hflights, filter(name -> endswith(String(name), "Delay"), names(hflights)))
# contains
@select(hflights, filter(name -> contains(String(name), "Month"), names(hflights)))

# AND conditions
@select(hflights, filter(name -> startswith(String(name), "Arr") && endswith(String(name), "Delay"), names(hflights)))
# OR conditions
@select(hflights, filter(name -> startswith(String(name), "Arr") || contains(String(name), "Cancel"), names(hflights)))

hflights7

# Kevin Markham's multiple select conditions example
# select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
# Julia Version of Kevin's Example
# Taxi or Delay in column heading
mask = [ismatch(r"Taxi|Delay", String(name)) for name in names(hflights)]
# Also include first three columns, i.e. Year, Month, DayofMonth
mask[1:3] = true
@select(hflights, mask)

These examples show you can select columns by position and name, and you can combine multiple conditions with AND, “&&”, or OR, “||”. Similar to filtering rows, you can select specific columns based on a pattern by using the ismatch function in an array comprehension. You can also use contains, startswith, and endswith in the filter function to select columns that contain, start with, or end with a specific text pattern.

“Chaining” or “Pipelining”

In R, dplyr provides, via the magrittr package, the %>% operator, which enables you to chain together multiple commands into a single data transformation pipeline in a very readable fashion. In Julia, the DataFramesMeta package provides the @linq macro and |> symbol to enable similar functionality. Alternatively, you can load the Lazy package and use an @> begin end block to chain together multiple commands.

# Chaining commands with DataFrameMeta’s @linq macro
@linq hflights[find(.!isna.(hflights[:,:DepDelay])), :] |>
@where(:DepDelay .> 60) |>
@select(:UniqueCarrier, :DepDelay)

# Chaining commands with Lazy’s @> begin end block
using Lazy
@> begin
hflights[find(.!isna.(hflights[:,:DepDelay])), :]
@where(:DepDelay .> 60)
@select(:UniqueCarrier, :DepDelay)
end

hflights8

These two blocks of code produce the same result, a DataFrame containing carrier names and departure delays for which the departure delay is greater than 60. In each chain, the first expression is the input DataFrame, e.g. hflights. In these examples, I use the find and !isna. functions to start with a DataFrame that doesn’t contain NA values in the DepDelay column because the commands fail when NAs are present. I prefer the @linq macro version over the @> begin end version because it’s so similar to the dplyr-magrittr syntax, but both versions are more succinct and readable than their non-chained versions. The screen shot shows how to assign the pipeline results to variables.

@orderby: Reorder rows

Both DataFrames and DataFramesMeta provide functions for sorting rows in a DataFrame by values in one or more columns. In the first pair of examples, we want to select the UniqueCarrier and DepDelay columns and then sort the results by the values in the DepDelay column in descending order. The last example shows how to sort by multiple columns with the @orderby macro.

# Julia DataFrames approach to sorting
sort(hflights[find(.!isna.(hflights[:,:DepDelay])), [:UniqueCarrier, :DepDelay]], cols=[order(:DepDelay, rev=true)])

# DataFramesMeta approach (add a minus sign before the column symbol for descending)
@linq hflights[find(.!isna.(hflights[:,:DepDelay])), :] |>
@select(:UniqueCarrier, :DepDelay) |>
@orderby(-:DepDelay)

# Sort hflights dataset by Month, descending, and then by DepDelay, ascending
@linq hflights |>
@orderby(-:Month, :DepDelay)

hflights9

DataFrames provides the sort and sort! functions for ordering rows in a DataFrame. sort! orders the rows, inplace. The DataFrames user guide provides additional examples of ordering rows, in ascending and descending order, based on multiple columns, as well as applying functions to columns, e.g. uppercase, before using the column for sorting.

DataFramesMeta provides the @orderby macro for ordering rows in a DataFrame. Specify multiple column names in the @orderby macro to sort the rows by multiple columns. Use a minus sign before a column name to sort in descending order.

@transform: Add new variables

Creating new variables in Julia DataFrames is similar to creating new variables in Python and R. You specify a new column name in square brackets after the name of the DataFrame and assign it a collection of values, sometimes based on values in other columns. DataFramesMeta’s @transform macro simplifies the syntax and makes the transformation more readable.

# Julia DataFrames approach to creating new variable
hflights[:Speed] = hflights[:Distance] ./ hflights[:AirTime] .* 60
hflights[:, [:Distance, :AirTime, :Speed]]

# Delete the variable so we can recreate it with DataFramesMeta approach
delete!(hflights, :Speed)

# DataFramesMeta approach
@linq hflights |>
@select(:Distance, :AirTime) |>
@transform(Speed = :Distance ./ :AirTime .* 60) |>
@select(:Distance, :AirTime, :Speed)

# Save the new column in the original DataFrame
hflights = @linq hflights |>
@transform(Speed = :Distance ./ :AirTime .* 60)

hflights10

The first code block illustrates how to create a new column in a DataFrame and assign it values based on values in other columns. The second code block shows you can use delete! to delete a column. The third example demonstrates the DataFramesMeta approach to creating a new column using the @transform macro. The last example shows how to save a new column in an existing DataFrame using the @transform macro by assigning the result of the transformation to the existing DataFrame.

@by: Reduce variables to values (Grouping and Summarizing)

dplyr provides group_by and summarise functions for grouping and summarising data. DataFrames and DataFramesMeta also support the split-apply-combine strategy with the by function and the @by macro, respectively. Here Julia versions of Kevin’s summarise examples.

# Julia DataFrames approach to grouping and summarizing
by(hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :],
:Dest,
df -> DataFrame(meanArrDelay = mean(df[:ArrDelay])))

# DataFramesMeta approach
@linq hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :] |>
@by(:Dest, meanArrDelay = mean(:ArrDelay))

hflights11

DataFrames and DataFramesMeta don’t have dplyr’s summarise_each function, but it’s easy to apply different functions to multiple columns inside the @by macro.

@linq hflights |>
@by(:UniqueCarrier,
meanCancelled = mean(:Cancelled), meanDiverted = mean(:Diverted))

@linq hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :] |>
@by(:UniqueCarrier,
minArrDelay = minimum(:ArrDelay), maxArrDelay = maximum(:ArrDelay),
minDepDelay = minimum(:DepDelay), maxDepDelay = maximum(:DepDelay))

hflights12

DataFrames and DataFramesMeta also don’t have dplyr’s n and n_distinct functions, but you can count the number of rows in a group with size(df, 1) or nrow(df), and you can count the number of distinct values in a group with countmap.

# Group by Month and DayofMonth, count the number of flights, and sort descending
# Count the number of rows with size(df, 1)
sort(by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = size(df, 1))), cols=[order(:flight_count, rev=true)])

# Group by Month and DayofMonth, count the number of flights, and sort descending
# Count the number of rows with nrow(df)
sort(by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = nrow(df))), cols=[order(:flight_count, rev=true)])

# Split grouping and sorting into two separate operations
g = by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = nrow(df)))
sort(g, cols=[order(:flight_count, rev=true)])

# For each destination, count the total number of flights and the number of distinct planes
by(hflights[find(.!isna.(hflights[:,:TailNum])),:], :Dest) do df
DataFrame(flight_count = size(df,1), plane_count = length(keys(countmap(df[:,:TailNum]))))
end

hflights13

While these examples reproduce the results in Kevin’s dplyr tutorial, they’re definitely not as succinct and readable as the dplyr versions. Grouping by multiple columns, summarizing with counts and distinct counts, and gracefully chaining these operations are areas where DataFrames and DataFramesMeta can improve.

Other useful convenience functions

Randomly sampling a fixed number or fraction of rows from a DataFrame can be a helpful operation. dplyr offers the sample_n and sample_frac functions to perform these operations. In Julia, StatsBase provides the sample function, which you can repurpose to achieve similar results.


using StatsBase
# randomly sample a fixed number of rows
hflights[sample(1:nrow(hflights), 5), :]
hflights[sample(1:size(hflights,1), 5), :]

# randomly sample a fraction of rows
hflights[sample(1:nrow(hflights), ceil(Int,0.0001*nrow(hflights))), :]
hflights[sample(1:size(hflights,1), ceil(Int,0.0001*size(hflights,1))), :]

hflights14

Randomly sampling a fixed number of rows is fairly straightforward. You use the sample function to randomly select a fixed number of rows, in this case five, from the DataFrame. Randomly sampling a fraction of rows is slightly more complicated because, since the sample function takes an integer for the number of rows to return, you need to use the ceil function to convert the fraction of rows, in this case 0.0001*nrow(hflights), into an integer.

Conclusion

In R, dplyr sets a high bar for wrangling data well with succinct, readable code. In Julia, DataFrames and DataFramesMeta provide many useful functions and macros that produce similar results; however, some of the syntax isn’t as concise and clear as it is with dplyr, e.g. selecting columns in different ways and chaining together grouping and summarizing operations. These are areas where Julia’s packages can improve.

I enjoyed becoming more familiar with Julia by reproducing much of Kevin’s dplyr tutorial. It was also informative to see differences in functionality and readability between dplyr and Julia’s packages. I hope you enjoyed this tutorial and find it to be a useful reference for wrangling data in Julia.

Foundations for Analytics with Python: From Non-programmer to Hacker

I’m excited to share that O’Reilly Media is about to publish my new book, Foundations for Analytics with Python: From Non-programmer to Hacker. The book is geared toward people who have no prior programming experience but deal with data every day and are interested in learning how to scale and automate their work.

Foundations for Analytics with Python by Clinton Brownley, PhD

I did not have a background in programming. I learned it on the job because I recognized it would enable me to automate repetitive actions and accomplish tasks that would be time-consuming or impossible with my current skill set. I read countless books, online tutorials, and blog posts in those first few weeks and months as I attempted to get my first program for work to do something useful for me. It’s difficult to fully describe how exhilarating and empowering it was when I finally got the program to work correctly. Needless to say, I was hooked, and I haven’t looked back.

I wrote the book with a few objectives in mind:

  • Be accessible to ambitious non-programmers
  • Be practical, so you can immediately see how you can use the code at work
  • Teach fundamental programming concepts and techniques and also provide alternative, shorter code that performs the same actions
  • Make the learning curve as short and shallow as possible so you can enjoy the fruits of your labor as quickly as possible

The book’s features reflect these objectives:

  • Each section focuses on one specific task, so you can learn how to accomplish that task without distractions
  • Each section is a complete, self-contained program, so you don’t have to remember to combine a bunch of code snippets to make them work
  • In the CSV and Excel chapters, each section of code has two versions, a base Python version and a Pandas version. The base Python version teaches you fundamental concepts and techniques. The Pandas version shortens and simplifies the code you need to write to accomplish the task
  • Uses the Anaconda Python 3 distribution, which bundles the newest version of Python with some of the most popular add-in packages
  • Includes screen shots of the input files, Python code, command line, and output files
  • Common data formats, including plain text, CSV, and Excel files, and databases
  • Common data processing tasks, including filtering for specific rows, selecting specific columns, and calculating summary statistics
  • Chapters on data analysis, plotting and graphing, and automation
  • Three real-world applications that illustrate how you can combine and extend techniques from earlier chapters to accomplish important data processing tasks
  • Both Windows and Mac commands and screen shots

To give you a feel for the book, let me provide a few sections of code from the book and the table of contents. The first section of code comes from the CSV chapter, the second section of code from the Excel chapter, and the third section of code from the Database chapter.  The brief comments after each section of code are for this blog post, they are not in the book.  If you want to see what other topics are included in the book, please see the table of contents at the bottom of this post.

Example Section #1: CSV Files

Reading and Writing a CSV File

Version #1: Base Python

#!/usr/bin/env python3
import csv
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, 'r', newline='') as csv_in_file:
    with open(output_file, 'w', newline='') as csv_out_file:
        filereader = csv.reader(csv_in_file, delimiter=',')
        filewriter = csv.writer(csv_out_file, delimiter=',')
        for row_list in filereader:
            filewriter.writerow(row_list)

Version #1 demonstrates how to read a CSV input file with base Python’s standard csv module and write the contents to a CSV output file. In the book, I explain every line of code. This first example gives you the ability to transfer all of your data to an output file. The subsequent examples in the chapter show you how to select specific data to write to the output file and how to process multiple CSV files.

Version #2: Pandas Add-in Module

#!/usr/bin/env python3
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
print(data_frame)
data_frame.to_csv(output_file, index=False)

Version #2 demonstrates how to accomplish the same task with Pandas. As you can see, you simply use read_csv to read the input file and to_csv to write to the output file.

Example Section #2: Excel Files

Reading and Writing an Excel Worksheet

Version #1: xlrd and xlwt Add-in Modules

#!/usr/bin/env python3
import sys
from xlrd import open_workbook
from xlwt import Workbook

input_file = sys.argv[1]
output_file = sys.argv[2]

output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('output_worksheet_name')

with open_workbook(input_file) as workbook:
    worksheet = workbook.sheet_by_name('input_worksheet_name')
    for row_index in range(worksheet.nrows):
        for column_index in range(worksheet.ncols):
            output_worksheet.write(row_index, column_index, \
                worksheet.cell_value(row_index, column_index))
output_workbook.save(output_file)

Version #1 demonstrates how to read and write an Excel worksheet with base Python and the xlrd and xlwt add-in modules. Again, this first example gives you the ability to transfer all of the data on one worksheet to an output file. The subsequent examples in the chapter show you how to select specific data to write to the output file, how to process multiple worksheets, and how to process multiple workbooks.

Version #2: Pandas Add-in Module

#!/usr/bin/env python3
import pandas as pd
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_excel(input_file, sheetname='input_worksheet_name')
writer = pd.ExcelWriter(output_file)
data_frame.to_excel(writer, sheet_name='output_worksheet_name', index=False)
writer.save()

Version #2 demonstrates how to accomplish the same task with Pandas. Again, you simply use read_excel to read the input worksheet and to_excel to write to the output worksheet.

Example Section #3: Databases

Query a table and write results to a file

#!/usr/bin/env python
import csv
import MySQLdb
import sys

output_file = sys.argv[1]

con = MySQLdb.connect(host='localhost', port=3306, db='my_suppliers', user='my_username', passwd='my_password')
c = con.cursor()

filewriter = csv.writer(open(output_file, 'wb'), delimiter=',')
header = ['Supplier Name','Invoice Number','Part Number','Cost','Purchase Date']
filewriter.writerow(header)

c.execute("""SELECT * FROM Suppliers WHERE Cost > 700.0;""")
rows = c.fetchall()
for row in rows:
    filewriter.writerow(row)

This example demonstrates how to connect to a database, query a table, and write the resulting data to a CSV output file. Other examples in the chapter explain how to load data into a database table from a file and update records in a table based on data in a file.

I hope these examples give you a feel for the book. If you want to see what other topics are included in the book, please see the table of contents shown below. Foundations for Analytics with Python is scheduled to be available in May 2016. Please keep an eye out for it, and if you know other people who may be interested please point them to this blog post and the Amazon link.  Thank you : )

 

TABLE OF CONTENTS

CHAPTER
Introduction
Why Read This Book/Why Learn These Skills
Who Is This Book For
Why Windows
Why Python
Install Anaconda Python
Text Editors
Download Book Materials
Base Python and Pandas
Overview of Chapters

CHAPTER
Python Basics
How To Create a Python Script
How To Run a Python Script
Numbers
Strings
Regular Expressions/Pattern Matching
Dates
Lists
Tuples
Dictionaries
Control Flow
Functions
Exceptions
Reading a Text File
Reading Multiple Text Files with Glob
Writing to a Text File
Writing to a Comma Separated Values “CSV” File
Print Statements

CHAPTER
Comma Separated Values “CSV” Text Files
Reading and Writing a CSV File (String Manipulation)
Reading and Writing a CSV File (Standard csv Module)
Filtering for Rows
    Value in Row Meets a Condition
    Value in Row is in a Set of Interest
    Value in Row Matches a Pattern (Regular Expression)
Selecting Columns
    Index Values
    Column Headings
Reading Multiple CSV Files
    Count Number of Files and Rows and Columns in Each File
    Concatenate Data From Multiple Files
    Sum and Average a Set of Values Per File
Selecting Contiguous Rows
Adding a Header Row

CHAPTER
Microsoft Excel Files
Introspecting an Excel Workbook
Reading a Single Worksheet
    Formatting Dates
    Filtering for Rows
        Value in Row Meets a Condition
        Value in Row is in a Set of Interest
        Value in Row Matches a Pattern (Regular Expression)
    Selecting Columns
        Index Values
        Column Headings
Reading All Worksheets
    Filtering for Rows from All Worksheets
    Selecting Columns from All Worksheets
Reading a Subset of Worksheets
    Filtering for Rows from Subset of Worksheets
    Selecting Columns from Subset of Worksheets
Reading Multiple Workbooks
    Count Number of Workbooks and Rows and Columns in Each Workbook
    Concatenate Data from Multiple Workbooks
    Sum and Average a Set of Values Per Worksheet Per Workbook

CHAPTER
Databases
Python’s Standard sqlite3 Module
    Create a Database
    Create a Database Table
    Insert Hand-written Data into a Database Table
    Query a Database Table
    Insert Data from a CSV File into a Database Table
    Update Records in a Database Table with Data from a CSV File
MySQL Database
    Create a Database
    Create a Database Table
    Insert Data from a CSV File into a Database Table
    Query a Database Table and Write Output to a CSV File
    Update Records in a Database Table with Data from a CSV File

CHAPTER
Applications
Find a Set of Items in a Large Collection of Excel and CSV Files
Parse a CSV File and Calculate a Statistic for Any Number of Categories
Parse a Text File and Calculate a Statistic for Any Number of Categories

CHAPTER
Graphing and Plotting
matplotlib
pandas
ggplot
seaborn

CHAPTER
Data Analysis
Descriptive statistics
Regression
Classification

CHAPTER
Automation
Windows: scheduled tasks
Mac: cron jobs

CHAPTER
Conclusion
Where To Go From Here
    Additional Built-Ins/Standard Modules
    Additional Add-In Modules
    Data Structures
How To Go From Here

APPENDIX
Downloads
Python
xlrd
mysqlclient/MySQL-python/MySQLdb
MySQL

Intro to Julia: Filtering Rows with R, Python, and Julia

In one of my earlier posts I introduced the Julia programming language by comparing how you can read and write CSV files in R, Python, and Julia. In this post I’d like to build on that comparison by describing how you can filter for specific rows in a data set in each language based on a filtering condition, set of interest, and pattern (i.e. regular expression). We’ll be using the same wine quality data set we used in the earlier post, which is available here: wine quality

We’ll use the same three row filters in all of the examples so it’s easy to confirm that the output is consistent across all three languages:

  1. The first row filter specifies a condition that restricts the output to rows where the value in the quality column is greater than 7.
  2. The second row filter specifies a set of interest that restricts the output to rows where the value in the quality column is either 7 or 8.
  3. The third row filter specifies a pattern that restricts the output to rows where the value in the quality column contains an 8.

R

To begin, let’s see how you can filter for specific rows in R. First we’ll cover how to filter for rows with base R, and then we’ll describe how to accomplish the same task with the data.table and dplyr packages, which are popular packages for managing data in R.

Base R

The following script illustrates how to read data into a data frame, filter for specific rows based on a filter condition, set of interest, and pattern, and write the output of interest to an output file. All three filtering sections show two different ways to filter the rows, first using row indexing and then using the subset function.

Copy and paste the following code into a text file and then save the file as filter_rows.R

#!/usr/bin/env Rscript
# For more information, visit: https://cbrownley.wordpress.com/

#Collect the command line arguments into a variable called args
args <- commandArgs(trailingOnly = TRUE)
# Assign the first command line argument to a variable called input_file
input_file <- args[1]
# Assign the second command line argument to a variable called output_file
output_file <- args[2]

# Use R’s read.csv function to read the data into a variable called wine
# read.csv expects a CSV file with a header row, so
# sep = ',' and header = TRUE are default values
# stringsAsFactors = FALSE means don’t convert character vectors into factors
wine <- read.csv(input_file, sep = ',', header = TRUE, stringsAsFactors = FALSE)

# Row Value Meets Specific Condition
wine_condition <- wine[which(wine$quality > 7), ]
# Using subset function
wine_condition <- subset(wine, quality > 7)
#print(wine_condition)

# Row Value In Set of Interest
set_of_interest <- c(7, 8)
wine_set <- wine[which(wine$quality %in% set_of_interest), ]
# Using subset function
wine_set <- subset(wine, quality %in% set_of_interest)
#print(wine_set)

# Row Value Matches Specific Pattern
pattern <- '^8$'
wine_pattern <- wine[grep(pattern, wine$quality, ignore.case=TRUE, perl=TRUE), ]
# Using subset function
#wine_pattern <- subset(wine, quality==pattern)
print(wine_pattern)

# Use R’s write.csv function to write the data in the variable wine to the output file
write.csv(wine_pattern, file = output_file, row.names = FALSE)

The section that filters for rows based on a condition only includes one condition, i.e. wine$quality > 7, but you can add more conditions with ANDs, &, and ORs, |. For example, to filter for rows where quality > 7 AND alcohol > 13.0 you can use:
wine_condition <- wine[which(wine$quality > 7 & wine$alcohol > 13.0), ]

Similarly, to filter for rows where quality < 4 OR alcohol > 13.0 you can use:
wine_condition <- wine[which(wine$quality < 4 | wine$alcohol > 13.0), ]

If instead you want to exclude a selection of rows you can negate the which function with a dash, -, like this:
wine_condition <- wine[-which(wine$quality > 7), ]

The section that filters for rows based on a set of interest uses the which function and the %in% binary operator to select rows where the value in the quality column is one of the two values in the set of interest. It’s convenient to assign the values of interest to a variable and then use the variable in the filtering condition so that if the values of interest change you only have to make one change where the values are assigned to the variable.

The section that filters for rows based on a pattern uses the Unix-inspired grep command to select rows where the pattern appears somewhere in the value in the quality column. The ^ metacharacter indicates that the 8 appears at the beginning of the value and the $ metacharacter indicates that the 8 appears at the end of the value, so enclosing the 8 between both ensures that grep looks for rows where 8 is the only value in the quality column. The ignore.case argument isn’t necessary in this case since we’re looking for number, but I included it to show you that it’s available and where to put it if you need it. Similarly, the ^ and $ metacharacters and perl argument aren’t necessary either since we’re searching for a simple number, but I included them to demonstrate how you can use a regular expression and the perl argument to search for a specific pattern.

Now run the following two commands in a Terminal window to make the script executable and to run the script:

chmod +x filter_rows.R
./filter_rows.R winequality-red.csv output/output_R.csv

When you run these commands you’ll see the following output printed to your Terminal screen. In addition, the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder inside your current folder.

Base R

R package: data.table

Now that we know how to filter for specific rows in base R, let’s discuss how to filter for rows with the data.table package. The following script illustrates how to do so. Copy and paste the following code into a text file and then save the file as filter_rows_data_table.R

#!/usr/bin/env Rscript
require(data.table)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)

# Row Value Meets Specific Condition
wine_condition <- wine[quality > 7]

# Row Value In Set of Interest
set_of_interest <- c(7, 8)
wine_set <- wine[quality %in% set_of_interest]

# Row Value Matches Specific Pattern
pattern <- '8'
wine_pattern <- wine[quality==pattern]
print(wine_pattern)

write.csv(wine_pattern, file = output_file, row.names = FALSE)

The filtering syntax is very similar to the syntax we used in the base R versions, so you can look in the base R section above for some explanations about the syntax. Now make the script executable and then run the script:

chmod +x filter_rows_data_table.R
./filter_rows_data_table.R winequality-red.csv output/output_R_data_table.csv

When you run these commands you’ll see the same output as you saw with base R printed to your Terminal screen and you’ll have written another CSV file in the output folder.

R package: dplyr

Now let’s see how to filter for rows with the dplyr package. Copy and paste the following code into a text file and then save the file as filter_rows_dplyr.R

#!/usr/bin/env Rscript
require(data.table)
require(dplyr)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)
wine <- tbl_df(wine)

# Row Value Meets Specific Condition
wine_condition <- wine %>% filter(quality > 7)

# Row Value In Set of Interest
set_of_interest <- c(7, 8)
wine_set <- wine %>% filter(quality %in% set_of_interest)

# Row Value Matches Specific Pattern
pattern <- '8'
wine_pattern <- wine %>% filter(quality==pattern)
print(wine_pattern)

write.csv(wine_pattern, file = output_file, row.names = FALSE)

All three filtering sections use the %>% operator, pulled into dplyr from the magrittr package, and dplyr’s own filter function. The %>% operator is pronounced, “then”, as in “do this, then, do that”. It takes the result of the operation on the left-hand-side of the operator and passes it as the first argument to the operation on the right-hand-side of the operator.

In all three filtering sections we’re simply using it to pass the data set into the filter function. Chaining these two operations doesn’t gain us much – it’s simply to demonstrate how you can use the %>% operator to chain operations together to make your code easier to read and understand. As you’d guess, the filter function filters for rows in the data set with a value that meets the filtering criterion or criteria. Now make the script executable and then run the script:

chmod +x filter_rows_dplyr.R
./filter_rows_dplyr.R winequality-red.csv output/output_R_dplyr.csv

As before, when you run these commands you’ll see the same output as you saw with base R and the data.table package printed to your Terminal screen and you will have written another CSV file in the output folder.

PYTHON

Now that we know how to filter for specific rows in R, let’s discuss how to filter for rows in Python. First we’ll cover how to filter for rows with base Python, and then we’ll describe how to accomplish the same tasks with Pandas, which is a popular package for managing data in Python.

Base Python

The following script illustrates how to process a CSV file line by line, filter for specific rows based on a filter condition, set of interest, and pattern, and write the results to an output file.

Copy and paste the following code into a text file and then save the file as filter_rows.py

#!/usr/bin/env python
# For more information, visit: https://cbrownley.wordpress.com/
# Import Python's built-in csv and sys modules, which have functions
# for processing CSV files and command line arguments, respectively
import csv
import re
import sys

# Assign the first command line argument to a variable called input_file
input_file = sys.argv[1]
# Assign the second command line argument to a variable called output_file
output_file = sys.argv[2]

header_row = True

# Open the input file for reading and close automatically at end
with open(input_file, 'rU') as csv_in_file:
    # Open the output file for writing and close automatically at end
    with open(output_file, 'wb') as csv_out_file:
        # Create a file reader object for reading all of the input data
        filereader = csv.reader(csv_in_file)
        # Create a file writer object for writing to the output file
        filewriter = csv.writer(csv_out_file)
        # Use a for loop to process the rows in the input file one-by-one
        for row in filereader:
            # Process the header row separately from the data rows
            # Print it to the screen, write it to the output file, and then
            # indicate that you're finished with the header row
            if header_row == True:
                print row
                filewriter.writerow(row)
                header_row = False
            # Process the data rows according to three filtering conditions
            else:
                # Row Value Meets Specific Condition
                #if int(row[11]) > 7:
                    #print row
                    #filewriter.writerow(row)

                # Row Value In Set of Interest
                #set_of_interest = [7, 8]
                #if int(row[11]) in set_of_interest:
                    #print row
                    #filewriter.writerow(row)

                # Row Value Matches Specific Pattern
                pattern = re.compile(r'(?P<my_pattern>8)', re.I)
                result = pattern.search(row[11])
                if result == None:
                    pass
                else:
                    print row
                    filewriter.writerow(row)

We process the header row separately from the data rows because we don’t want to test the header row against the filtering conditions. The built-in csv module reads each row from the input file as a list, a.k.a. array, so we use list indexing, row[11], to access the values in the quality column, which is the twelfth column in the data set (in Python, the first array index is 0). The first two filtering sections are fairly straightforward, i.e. in the first section, the integer version of the value in the quality column is > 7 and in the second section it is one of the values in the set of interest. These two sections are currently commented out with # symbols, but you can uncomment the sections one-at-a-time to see how the output changes.

The third section uses the re module to create a regular expression, search for the pattern in the quality column, and print and write the row when the value in the quality column matches the pattern. The re.I argument makes the pattern case-insensitive. As we said in the R section, we don’t need the argument in this case but it’s helpful to know where to include it if you need it. Now make the script executable and then run the script:

chmod +x filter_rows.py
./filter_rows.py winequality-red.csv output/output_Python.csv

When you run these commands you’ll see the following output printed to your Terminal screen. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

Base Python

Python package: Pandas

Now let’s see how to filter for rows with Pandas. Copy and paste the following code into a text file and then save the file as filter_rows_pandas.py

#!/usr/bin/env python
import sys
import string
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)

# Row Value Meets Specific Condition
data_frame_value_meets_condition = data_frame[data_frame['quality'].astype(int) > 7]

# Row Value In Set of Interest
set_of_interest = [7, 8]
data_frame_value_in_set = data_frame[data_frame['quality'].isin(set_of_interest)]

# Row Value Matches Specific Pattern
data_frame_value_matches_pattern = data_frame[data_frame['quality'].astype(str).str.contains("8")]
print(data_frame_value_matches_pattern)

data_frame_value_matches_pattern.to_csv(output_file, index=False)

In the first filtering section we select the quality column, convert the values into integers, and then test whether they’re greater than 7. In the second section we use the isin function to test whether the value in the quality column is one of the values in the set of interest. Finally, in the third section, we use the contains function to test whether the value in the quality column contains 8. There are also startswith and endswith functions in case you need to test whether the value starts with or ends with a specific pattern.

Pandas also has a convenient .ix function that you can use to filter for specific rows and columns at the same time. Here’s how you could modify the first filtering section to use the .ix function: data_frame_value_meets_condition = data_frame.ix[data_frame.quality.astype(int) > 7, :]

You can select the column by typing data_frame.column. Like R, you need to separate the rows and columns sections with a comma, and you use a colon to indicate that you want to select all of the rows or columns (In this case we want to select all of the columns). Now run the following two commands to make the script executable and to run the script:

chmod +x filter_rows_pandas.py
./filter_rows_pandas.py winequality-red.csv output/output_Python_Pandas.csv

When you run these commands you’ll see similar output as you saw with base Python printed to your Terminal screen, although it will be formatted differently. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

JULIA

Now that we know how to filter for specific rows in Python, let’s discuss how to filter for rows in Julia. First we’ll cover how to filter for rows with base Julia, and then we’ll describe how to accomplish the same tasks with DataFrames, which is a popular package for managing data in Julia.

Base Julia

The following script illustrates how to read a CSV file line by line, filter for specific rows based on a filter condition, set of interest, and pattern, and write the output of interest to an output file.

Copy and paste this code into a text file and then save the file as filter_rows.jl

#!/usr/bin/env julia
# For more information, visit: https://cbrownley.wordpress.com/

# Assign the first command line argument to a variable called input_file
input_file = ARGS[1]
# Assign the second command line argument to a variable called output_file
output_file = ARGS[2]

# Open the output file for writing
out_file = open(output_file, "w")

header_row = true
# Open the input file for reading and close automatically at end
open(input_file, "r") do in_file
    # Use a for loop to process the rows in the input file one-by-one
    for row in eachline(in_file)
        if header_row == true
            print(row)
            write(out_file, row)
            global header_row = false
        else
            row_array = map(float, split(strip(row), ","))
            #println(row_array)

            # Row Value Meets Specific Condition
            if row_array[12] > 7.0
                #print(join(row_array, ",") * "\n")
                #write(out_file, join(row_array, ",") * "\n")
            end

            # Row Value In Set of Interest
            set_of_interest = Set(7.0, 8.0)
            if in(row_array[12], set_of_interest)
                #print(join(row_array, ",") * "\n")
                #write(out_file, join(row_array, ",") * "\n")
            end

            # Row Value Matches Specific Pattern
            pattern = r"8$"
            if ismatch(pattern, row)
                print(row)
                write(out_file, row)
            end

        # Close the if-else statement
        end
    # Close the for loop
    end
# Close the input file handle
end
# Close the output file handle
#close(out_file)

Let’s explain some of the syntax in this script that’s different from R and Python. For example, the open(…) do statement creates an anonymous function with its own scope, so when we initially define the variable header_row above the open(…) do statement and then assign a new value to the variable inside the open(…) do statement we have to precede the variable name with the keyword global.

It’s helpful to keep in mind that for, while, try, and let blocks also default to local scopes, but they do inherit from a parent scope like the one created by the open(…) do statement. Therefore, if we initially define the variable header_row right beneath the open(…) do statement, then the for loop will inherit the variable from the parent scope and we won’t need to precede the variable name with the keyword global. That is, the following alternative syntax would work too:

open(input_file, "r") do in_file
    header_row = true
    # Use a for loop to process the rows in the input file one-by-one
    for row in eachline(in_file)
        if header_row == true
            print(row)
            write(out_file, row)
            header_row = false

Like base Python without the csv module, Julia reads each row from the file in as a string, so we use the strip function to remove the trailing newline character, then the split function to split the string on commas and convert it into an array, and finally we map the float function to each of the elements in the array to convert all of the values to floating-point numbers.

The println function adds a newline character on the end of the line before printing the line to the screen whereas the print function does not, it prints the line as-is.

The first two row filtering sections use the join function and the string concatenation symbol, *, to create the row of output that will be printed to the screen and written to the output file. In this case, the join function places commas between each of the elements in the array and converts it to a string. Then we add a newline character to the end of the string with the * concatenation symbol.

We test whether the value in the quality column (in Julia, the first array index is 1), is in the set of interest with the in function. Similarly, we test whether the pattern appears in the row using the ismatch function. You’ll notice that we leave the row as a string, i.e. we don’t convert it into an array, to use the ismatch function since ismatch looks for the pattern in a string, not an array.

Now run the following two commands to make the script executable and to run the script:

chmod +x filter_rows.jl
./filter_rows.jl winequality-red.csv output/output_Julia.csv

When you run these commands you’ll see the following printed to your Terminal screen. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

Base Julia

Julia packages: DataFrames and DataFramesMeta

Now that we know how to filter for specific rows in base Julia, let’s discuss how to filter for rows with DataFrames, a popular package for managing tabular data in Julia. The following script illustrates how to do so. Copy and paste this code into a text file and then save the file as filter_rows_data_frames.jl

#!/usr/bin/env julia
using DataFrames
using DataFramesMeta

input_file = ARGS[1]
output_file = ARGS[2]

data_frame = readtable(input_file, separator = ',')

# Row Value Meets Specific Condition
data_frame_value_meets_condition = data_frame[data_frame[:quality] .> 7, :]
#data_frame_value_meets_condition = data_frame[(data_frame[:quality] .== 7) | (data_frame[:quality] .== 8), :]
#println(data_frame_value_meets_condition)

# Row Value In Set of Interest
set_of_interest = Set(7, 8)
data_frame_value_in_set = data_frame[findin(data_frame[:quality], set_of_interest), :]
#println(data_frame_value_in_set)

# Row Value Matches Specific Pattern
pattern = r"8"
data_frame_value_matches_pattern = data_frame[[ismatch(pattern, string(value)) for value in data_frame[:quality]], :]
#data_frame_value_matches_pattern = @where(data_frame, [ismatch(pattern, string(value)) for value in :quality])
println(data_frame_value_matches_pattern)

writetable(output_file, data_frame_value_matches_pattern)

The first filtering section demonstrates how you can use one condition or multiple conditions to filter for specific rows. In Julia, you precede the comparison operator with a period, for example .==, to do element-wise comparisons. To use multiple conditions you wrap each one in parentheses and combine them with ANDs, &, or ORs, |. The colon has the same meaning that it does in R and Python. In this case, we’re using it to retain all of the columns.

We use the findin function to determine which rows have the value 7 or 8 in the quality column in order to retain these rows.

The third filtering section demonstrates two slightly different ways to filter for rows based on a pattern. Both methods use row indexing, the ismatch function, and array comprehensions to look for the pattern in each of the elements in the quality column/array. The only real difference between the two methods is that in the first method we have to specify data_frame[…] twice, whereas the second method uses the @where meta-command from the DataFramesMeta package to enable us to refer to the data_frame once and then refer to the quality column with :quality instead of the slightly more cumbersome data_frame[quality].

Now run the following two commands to make the script executable and to run the script:

chmod +x filter_rows_data_frames.jl
./filter_rows_data_frames.jl winequality-red.csv output/output_Julia_DataFrames.csv

When you run these commands you’ll see similar output as you saw with base Julia printed to your Terminal screen, although it will be formatted differently. In addition, the header row and the set of rows matching the pattern in the final filtering section have been written to a CSV file in the output folder.

As you can see, when it comes to filtering for specific rows, the differences in syntax between Python and Julia are very slight. For example, Python’s “if value in set_of_interest” statements are “in(value, set_of_interest)” statements in Julia, and Python’s pattern matching “pattern.search()” statements are “ismatch()” statements in Julia. On the other hand, one difference to keep in mind is that for loops in Julia default to local scope so if you’ve defined a variable outside of a for loop and you need to use it inside the for loop, then you need to precede the variable name with the word global.

Now that we know how to read and write data in a CSV-formatted input file and filter for specific rows with R, Python, and Julia, the next step is to figure out how to filter for specific columns in these languages. Then we can move on to processing lots of files in a directory and also dealing with Excel files. We’ll cover these topics in future posts.

I’d like to thank the Julia users group, especially Nils Gudat and David Gold, for helping me figure out how to use the findin and ismatch functions to filter for specific rows while using the DataFrames package.

All Scripts and Output Files

Simplifying scikit-learn Predictive Modeling with skll APIs and Configuration Files

One blog post I really enjoyed reading last year is yhat’s post, Predicting customer churn with scikit-learn. In the post, the author demonstrates how to estimate, cross-validate, and measure the performance of three predictive models for classification in Python with the scikit-learn package. At the time, I was more familiar with how to do predictive modeling in R with the caret package, as described in the well-written book Applied Predictive Modeling. However, since I also knew some Python and was interested in learning how to use it for predictive modeling, yhat’s post was exactly what I needed to get started.

More recently, I read and followed along with the examples in Jereon Janssens’s excellent book, Data Science at the Command Line. In Chapter 9, the author demonstrates how to use another Python package, the SciKit-Learn Laboratory (a.k.a. skll) package, to implement predictive modeling with scikit-learn from the command line. The skll package is very helpful because it provides an API interface to scikit-learn that simplifies the code you need to implement your predictive models. Additionally, it also enables you to specify your modeling parameters in a configuration file so you can run your models from the command line.

Since I’d become familiar with how to implement predictive modeling with scikit-learn, I really enjoyed learning to use skll’s API and configuration file interfaces. To see how skll’s interfaces simplify the process of using the scikit-learn package to implement predictive modeling, let’s estimate, cross-validate, and measure the performance of four predictive models with scikit-learn and then perform similar operations with skll’s interfaces.

DATA SET

The data set we’ll use in this post comes from the publicly available wine quality data sets, which are available here. There is a file for red wines and a file for white wines. The data set used in this post is these two files concatenated together (with only one header row). The dependent variable in the original research took on integer values representing wine quality, so the data set lends itself to regression tasks. However, in this post we’ll demonstrate how to perform classification tasks by classifying wines as either red or white. You can download the data and prepare the final data set by entering the following commands in a Terminal window:

# Download two CSV files, winequality-red.csv and winequality-white.csv
# from the UCI Machine Learning Repository
# and save the files as wine-red.csv and wine-white.csv
parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-{}.csv > wine-{}.csv" ::: red white

# Check that you now have wine-red.csv and wine-white.csv
# in your current working directory
find . -maxdepth 1 -name "wine*"

# For each file, convert uppercase to lowercase,
# convert semicolons to commas,
# convert spaces to underscores,
# remove double quotes, and
# add a column that says 'red' or 'white' depending on the file
for T in red white; do < wine-$T.csv tr '[A-Z]; ' '[a-z],_' | tr -d \" | sed "s/$/,${T}/" > wine-${T}-clean.csv; done

# Review the first five lines in each file to ensure the changes are correct
head -n 5 wine-{red,white}-clean.csv | fold

# Concatenate the red and white wine files into one file, wine-both-clean.csv
# Retain only one column heading
# Name the last column 'type' since it contains the words 'red' and 'white'
head -n 1 wine-red-clean.csv > wine-both-clean.csv; grep -v quality wine-red-clean.csv >> wine-both-clean.csv; grep -v quality wine-white-clean.csv >> wine-both-clean.csv; sed -i -e "1s/,red/,type/" wine-both-clean.csv

# Review the row counts for red (1,599) and white (4,898) wines
parallel --tag "grep -c {} wine-both-clean.csv" ::: red white

Now that we have our data set, wine-both-clean.csv, let’s move on to estimating our predictive models with scikit-learn.

SCIKIT-LEARN

In order to understand the ways in which skll simplifies the syntax needed to perform predictive modeling with the scikit-learn package, let’s first estimate, cross-validate, and measure the performance of four predictive models with scikit-learn. To do so, copy and paste the following code into a text editor and save the file as classify_wine_scikit_learn.py:

#!/usr/bin/env python
import sys
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LogR
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

input_file = sys.argv[1]

# Read the wine quality data into a Pandas data frame
wine_data_frame = pd.read_csv(input_file)

# Create a red wine binary variable named 'y' for classification
wine_type = wine_data_frame['type']
y = np.where(wine_type == 'red', 1., 0.)

# Specify X, the matrix of predictor variables
features = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
        'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
        'ph', 'sulphates', 'alcohol']
wine_features = wine_data_frame[features]
X = wine_features.as_matrix().astype(np.float)
print "\nFeature space holds %d observations and %d features" % (X.shape)
print "\nAccuracy if you predict all 0s (baseline or benchmark): %0.3f" % (accuracy_score(y, [0 for value in y.tolist()]))
print "\n*****************************************\n"

# Center and scale the predictor variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Specify a function for k-fold cross-validation
def run_cv(X, y, clf, **kwargs):
    kfold = KFold(len(y), n_folds=10, shuffle=True, random_state=123456789)
    print str(clf)
    y_pred = y.copy()
    from time import time
    t0 = time()
    fold = 0
    for train_index, test_index in kfold:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf.fit(X_train, y_train)
        y_pred[test_index] = clf.predict(X_test)
        fold += 1
        print "Finished fold:", str(fold)
    print "Cross-validation took %0.2f seconds." % (time() - t0)
    return y_pred

# K NEAREST NEIGHBORS
# Estimate predicted values for k-nearest-neighbors classification model with cross-validation
y_pred_knn = run_cv(X, y, KNN(n_neighbors=6))

# Calculate performance metrics for k-nearest-neighbors classification model
print "\nK-nearest-neighbors (accuracy): " + "%.3f" % accuracy_score(y, y_pred_knn)
print "\nKNN confusion matrix: "
print(confusion_matrix(y_pred_knn, y))
target_names = ['White (0)', 'Red (1)']
print "\nKNN classification report: "
print(classification_report(y, y_pred_knn, target_names=target_names))
print "\n*****************************************\n"

# LOGISTIC REGRESSION
# Estimate predicted values for logistic/logit classification model with cross-validation
y_pred_logr = run_cv(X, y, LogR(random_state=123456789))

# Calculate performance metrics for logistic/logit classification model
print "\nLogistic (accuracy): " + "%.3f" % accuracy_score(y, y_pred_logr)
print "\nLogistic confusion matrix: "
print(confusion_matrix(y_pred_logr, y))
target_names = ['White (0)', 'Red (1)']
print "\nLogistic classification report: "
print(classification_report(y, y_pred_logr, target_names=target_names))
print "\n*****************************************\n"

# RANDOM FOREST
# Estimate predicted values for random forest classification model with cross-validation
y_pred_rf = run_cv(X, y, RFC(n_estimators=500, random_state=123456789))

# Calculate performance metrics for random forest classification model
print "\nRandom forest (accuracy): " + "%.3f" % accuracy_score(y, y_pred_rf)
print "\nRF confusion matrix: "
print(confusion_matrix(y_pred_rf, y))
target_names = ['White (0)', 'Red (1)']
print "\nRF classification report: "
print(classification_report(y, y_pred_rf, target_names=target_names))
print "\n*****************************************\n"

# SUPPORT VECTOR MACHINES
# Specify grid parameters for the support vector machines classifier (SVC)
# Reference: http://scikit-learn.org/0.11/tutorial/statistical_inference/model_selection.html#grid-search
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
gamma_values = [0.01, 0.1, 1.0, 10.0, 100.0]
param_grid = dict(kernel=['rbf'], gamma=gamma_values, C=C_values)
svc_gscv = GridSearchCV(SVC(cache_size=1000, class_weight='auto', random_state=123456789), param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Estimate predicted values for support vector machines classification model with cross-validation
y_pred_svc = run_cv(X, y, svc_gscv)

# Calculate performance metrics for support vector machines classification model
print "\nSupport vector machines (accuracy): " + "%.3f" % accuracy_score(y, y_pred_svc)
print "\nSVM confusion matrix: "
print(confusion_matrix(y_pred_svc, y))
target_names = ['White (0)', 'Red (1)']
print "\nSVM classification report: "
print(classification_report(y, y_pred_svc, target_names=target_names))

This code is similar to the code discussed in yhat’s post, Predicting customer churn with scikit-learn, so instead of describing the lines of code in detail I’ll summarize the sections of code. At the top, we import all of the functions we’re going to use to manage and transform the data, estimate the models, perform k-fold cross-validation, and evaluate the performance of the models.

Next, we read the data into a Pandas data frame and create a new binary variable named ‘y’ that equals 1.0 for red wines and 0.0 for white wines. Next, we specify which variables are going to serve as the independent, explanatory, predictor variables and transform them into a matrix. At this point, we print the number of rows and columns in the analysis data set and print the baseline, benchmark accuracy we achieve by naïvely predicting all of the observations to be zeros (i.e. the majority of the observations are zeros). Hopefully, the performance of our predictive models will surpass this benchmark performance!

Next, we center and scale the predictor variables so they all have a mean of zero and a standard deviation of one. These transformations help ensure that the learning algorithms and predictive performance of the statistical models aren’t influenced by the predictor variables’ units of measurement.

Next, we define a function for performing k-fold cross-validation. The function specifies a particular random state to be consistent with skll’s default random state. It also includes a few print statements so you can see the folds being processed and see how long the cross-validation takes for different models. You can find an example of stratified k-fold cross-validation, a related procedure that attempts to balance the percentage of observations from each class in each fold, in Bugra’s blog post, An Introduction to Supervised Learning via Scikit Learn. This post also demonstrates how to produce plots of training and test errors, confusion matrices, and variable importance scores.

Finally, we run cross-validation with four predictive models, (1) k-nearest neighbors, (2) logistic regression, (3) random forest, and (4) support vector machines and report on their performance with an accuracy score, a confusion matrix, and a classification report. The confusion matrix shows the number of correctly classified observations along the main diagonal and the classification report includes information on precision, recall, f1-score, and support for each category.

One final point is that the support vector machine section demonstrates how to use grid search to select optimal parameters for the model. To be consistent with skll’s grid search defaults, I instructed grid search to use 3-fold cross-validation, a particular random state, and accuracy to determine the optimal values for C and gamma. This 3-fold cross-validation takes place within the 10-fold cross-validation taking place to measure the model’s predictive performance.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_scikit_learn.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_scikit_learn.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

scikit-learn script output

As you can see, you can achieve an accuracy score of 0.75 by naïvely predicting all of the observations to be zero. Since the rest of the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.18 seconds

K-Nearest Neighbors
Accuracy: 0.994
Processing time: 0.50 seconds

Random Forest
Accuracy: 0.995
Processing time: 36.44 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 225.42 seconds

The output shows that, in this example, all of the models achieve similar accuracy scores. It also shows that logistic regression completes 3 times faster than k-nearest-neighbors and 1,252 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with scikit-learn, let’s take a look at how to do so with skll’s APIs to scikit-learn.

SKLL API

If you want to follow along with the next two sections you’ll need to install skll. Here are instructions for doing so: skll There can be compatibility issues between scikit-learn and skll, depending on the versions you have installed. To follow along with this post, make sure you have scikit-learn==0.15.2 and skll==1.0.1. You can check which versions you have by typing the following on the command line and then hitting Enter: pip freeze. Once you’ve installed skll, copy and paste the following code into a text editor and save the file as classify_wine_skll.py:

#!/usr/bin/env python
import sys
import time
from skll.data.readers import Reader
from skll.learner import Learner

input_file = sys.argv[1]

file_reader = Reader.for_path(input_file, label_col='type')

training_data = file_reader.read()

number_of_folds = 10

def average_accuracy(fold_results):
    import numpy as np
    accuracy = []
    for fold_index in range(number_of_folds):
        accuracy.append(fold_result_list[fold_index][1])
    return np.mean(accuracy)

print "\nLogistic:"
logistic_learner = Learner('LogisticRegression', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nKNN:"
knn_learner = Learner('KNeighborsClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = knn_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nRF:"
rf_learner = Learner('RandomForestClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nSVC:"
svc_learner = Learner('SVC', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = svc_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=True, grid_search_folds=3, grid_objective=u'f1_score_micro', \
param_grid=[{'C':[0.01, 0.1, 1.0, 10.0, 100.0]}, {'gamma':[0.01, 0.1, 1.0, 10.0, 100.0]}], shuffle=True)
print "Grid search %d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

One of the first things you’ll notice is how much less code we need to run the same cross-validation exercise we ran in the previous section. At the top, we import skll’s Reader and Learner objects, which we’ll use to read in the training data and to specify the statistical learning models we want to cross-validate, respectively.

You’ll notice that when we use the Reader object to create a file reader object we specify the name of the column that’s to be used as the dependent, ‘y’ variable. In this example, we specify that the label column is ‘type’ because it contains the class labels. We also create a variable for the number of cross-validation folds so the code is a little more flexible and you can experiment with 5-fold, 10-fold, or 20-fold cross-validation more easily.

The next block of code defines a function to calculate the average accuracy of the model based on cross-validation. We create this simple helper function because, as described in the API’s documentation, the output of cross-validation is a pair of lists. The first list contains the confusion matrix, overall accuracy, per-label precision, recall, and F-measures, and model parameters for each fold. The second list contains the grid search scores, if any, for each fold. This helper function collects all of the accuracy scores from the cross-validation output and returns the average accuracy across all of the folds.

The next four blocks of code initialize the four statistical learning models, cross-validate the models, and report their results. For example, the first learner is logistic regression. We’re not interested in calculating probabilities, so we set probability equal to False. To be consistent with the centering and scaling we did to the predictor variables in base scikit-learn, we set feature scaling to both.

Next, we perform cross-validation with the model. In this case, we set stratified equal to True to use stratified k-fold cross-validation so each fold contains nearly the same number of red and white wines. Since we set number_of_folds to 10 and cv_folds equals number_of_folds, we’re performing 10-fold cross-validation. For logistic regression, we’re not employing grid search, and we set shuffle equal to True so the observations are shuffled before they’re split into folds for cross-validation. Some of these options are the defaults, and some of them are changed to be consistent with the scikit-learn code we used above.

The k-nearest neighbors and random forest sections are identical to the logistic regression section, except for employing different predictive models. To be consistent with the scikit-learn code we used above, the support vector machines section contains some additional code to employ grid search for optimal C and gamma values.

Finally, we print how long cross-validation takes to complete, all of the results for each of the folds, the grid search scores (if any), and the average accuracy across all of the folds.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_skll.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_skll.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

skll script output

Since the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.85 seconds

K-Nearest Neighbors
Accuracy: 0.993
Processing time: 1.22 seconds

Random Forest
Accuracy: 0.994
Processing time: 0.88 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 57.60 seconds

Once again the output shows that, in this example, all of the models achieve similar accuracy scores. In this case, logistic regression completes 1.4 times faster than k-nearest-neighbors and 68 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with skll’s APIs, let’s take a look at how to do so on the command line with skll’s configuration file.

SKLL CONFIGURATION FILE

As I mentioned at the top of this post, Jereon Janssens demonstrates how to use skll’s configuration file set-up in his book, Data Science at the Command Line. You can also read skll’s own tutorial for using the configuration file set-up here.

skll’s configuration file set-up requires an input file to be formatted slightly differently than the way ours is now, so we need to modify our input file. Specifically, we need (1) an additional row index column with unique numbers for each row, (2) the wine type column to contain the floating-point numbers 1.0 and 0.0 instead of the strings red and white, and (3) the file should only contain the binary ‘type’ column and the predictor variables we want to use (i.e. it shouldn’t contain additional variables we don’t intend to use). Since our input file contains the ‘quality’ column and we don’t want to use it as a predictor we need to remove it from the file.

You can create the modified version of the input file by entering the following one-line, piped command in a Terminal window:

< wine-both-clean.csv nl -s, -w4 -n rz -v0 | sed 's/0000,/id,/' | sed 's/red/1\./' | sed 's/white/0\./' | cut -d, -f1-12,14 > wine-both-clean-ids.csv

The nl command adds line numbers to each of the rows. -s, says add a comma after the line number (since we’re using a CSV file). -w4 says make the line numbers four digits wide (we know the input file contains 6,498 rows, so four digits wide will be sufficient). -n rz says insert line numbers according to the rz format, which is right-justified with leading zeros. -v0 says start the line numbering at 0, or in this case, 0000.

Next, the first sed command changes the line number in the first line from 0000, to the column heading id,. The second sed command changes the word red into ‘1.’, and the third sed command changes the word white into ‘0.’. The cut command separates the file into columns based on the comma delimiter and selects, or keeps, columns one to twelve and column fourteen (thereby removing the ‘quality’ variable in column thirteen). The result of these operations is redirected to a new output file called wine-both-clean-ids.csv. We’ll use this new file as our input file.

Now that our input file is ready to be processed, let’s create our skll configuration file. To do so, copy and paste the following code into a text editor and save the file as classify_wine_skll.cfg (Note that the file extension is .cfg instead of .py):

[General]
    experiment_name = Wine
    task = cross_validate

[Input]
    train_directory = .
    featuresets = [["wine-both-clean-ids.csv"]]
    learners = ["RandomForestClassifier", "SVC", "KNeighborsClassifier", "LogisticRegression"]
    label_col = type
    id_col = id
    shuffle = True
    feature_scaling = both

[Tuning]
    objective = accuracy
    grid_search = True
    param_grids = [[], [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}], [], []]

[Output]
    log = results
    results = results
    predictions = results

skll’s configuration file contains four sections, General, Input, Tuning, and Output. You can learn more about each of these sections here. In the General section, we specify a name for the experiment. All of the output file names will start with the word we use. We also specify the task we want to perform, which is cross_validate.

In the Input section, we specify the train_directory, the folder where the configuration file can find the training data set. The period, ‘.’, is shorthand for the current folder. If you save the input file in a different folder, then you’ll have to supply the folder name (e.g. training_data or my_input_files). featuresets contains the name of the input file(s) that contain your features, i.e. explanatory variables. In this case they’re all in one file but, as skll’s tutorial demonstrates, they can be spread across multiple input files. learners is a list of the learners we want to use. label_col indicates which column contains the class labels. id_col indicates which column contains the unique row numbers. Once again, to be consistent with the preceding examples, shuffle equals True and feature_scaling equals both.

In the Tuning section, we specify we want to use accuracy as our objective. Since we want to use grid search for one of our models, we set grid_search to True. When grid search is True, you have to supply a list of lists to param_grids, one list for each model. A list can be empty if you don’t want to perform grid search for the model, but you still need a set of square brackets for each model. If you do want to perform grid search for a model, then you supply a dictionary of the parameters and values you want to search over and optimize. Since we’re performing grid search for the support vector machines model we supply ‘C’ and ‘gamma’ as two keys and lists of values to search over as the values associated with each key.

Finally, in the Output section, we specify the name of the folder we want all of the output to be saved in. In this case, all of the logs, results, and predictions will be saved in a folder named ‘results’ inside our current folder.

Run this configuration file (a.k.a. experiment) by typing the following on the command line and then hitting Enter:
run_experiment classify_wine_skll.cfg

After you hit Enter, you’ll see the following output printed to your command prompt window after all four of the models have completed:

Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done

In addition, several output files have been written in the ‘results’ folder inside your current folder. You can cd and/or ls into the results folder to view the output. skll creates four files for each model: a log file, a predictions file, a results file, and a results file formatted as JSON. skll also creates a summary file that contains details on each of the models and each of the cross-validation folds.

One of the measures we’ve been using to compare the models is average accuracy. This value is available in the summary file. Jereon Janssens demonstrates how to access and print this value using some helpful command line tools in his book, but you can also print it out with basic Unix commands. To view the average accuracy for each of the models, type the following command on the command line and then hit Enter (assuming you named the output folder ‘results’ and it’s inside your current folder):

grep average results/Wine_summary.tsv | cut -f1,13 | awk -F\t '{ print $2 ": " $1 }'

The grep command filters for rows in the tab-delimited summary file that contain the word average. The cut command separates the columns in the file based on tabs, which we didn’t have to specify because it’s the default in cut. Then we select the first column (i.e. the average accuracy score) and the thirteenth column (i.e. the name of the classifier). Finally, the awk command re-arranges the two pieces of information so that what’s printed to the screen is the name of the classifier, followed by a colon and a space, and then the average accuracy score. When you run this command you’ll see the following output printed to the Terminal window:

skll configuration file output

Once again, all of the models have similar accuracy scores.

As we’ve seen, skll’s APIs and configuration file set-up encapsulate and simplify a lot of the basic scikit-learn code you need to read input data, transform variables, and estimate, cross-validate, and measure the performance of predictive models. Now that you’re familiar with skll’s general interfaces and syntax, try modifying the code to use your own input data, estimate different models, measure performance with a different metric, or perform other tasks, like predict or evaluate instead of cross_validate. Also be sure to check out the additional resources noted throughout this post for supplementary explanations and examples. Have a great time experimenting with skll’s interfaces!

To Read or Not to Read? Hopefully the Former!

It has been far too long since I posted to this blog.  It’s time to let you know what I’ve been working on over the past few months.  I’ve been writing a book on a topic that is near and dear to my heart.

The book is titled Multi-objective Decision Analysis: Managing Trade-offs and Uncertainty.  It is an applied, concise book that explains how to conduct multi-objective decision analyses using spreadsheets.

The book is scheduled to be published by Business Expert Press in 2013.  For a little more information about my forthcoming book, please read the abstract shown below:

“Whether managing strategy, operations, or products, making the best decision in a complex uncertain business environment is challenging.  One of the major difficulties facing decision makers is that they often have multiple, competing objectives, which means trade-offs will need to be made.  To further complicate matters, uncertainty in the business environment makes it hard to explicitly understand how different objectives will impact potential outcomes.  Fortunately, these problems can be solved with a structured framework for multi-objective decision analysis that measures trade-offs among objectives and incorporates uncertainties and risk preferences.

This book is designed to help decision makers by providing such an analysis framework implemented as a simple spreadsheet tool.  This framework helps structure the decision making process by identifying what information is needed in order to make the decision, defining how that information should be combined to make the decision and, finally, providing quantifiable evidence to clearly communicate and justify the final decision.

The process itself involves minimal overhead and is perfect for busy professionals who need a simple, structured process for making, tracking, and communicating decisions.  With this process, decision making is made more efficient by focusing only on information and factors that are well-defined, measureable, and relevant to the decision at hand.  The clear characterization of the decision required by the framework ensures that a decision can be traced and is consistent with the intended objectives and organizational values.  Using this structured decision-making framework, anyone can effectively and consistently make better decisions to gain a competitive and strategic advantage.”

Look for my forthcoming book, Multi-objective Decision Analysis, on the bookshelves in 2013!

Source: favim.com