Visualizing Global Land Temperatures in Python with scrapy, xarray, and cartopy

A few years ago, I worked on a project that involved collecting data on a variety of global environmental conditions over time. Some of the data sets included cloud cover, rainfall, types of land cover, sea temperature, and land temperature. I enjoyed developing a greater understanding of our Earth by visualizing how these conditions vary over time around the planet. To get a sense of how fun and informative it can be to analyze environmental data over time, let’s work on visualizing global land surface temperatures from 2001 to 2016.

12-monthly-averages

Data

The data we’ll use in this post are NASA Earth Observation’s Land Science Team’s daytime land surface temperatures, “temperatures of the “skin” (or top 1 millimeter) of the land surface during the daytime, collected by the Moderate Resolution Imaging Spectroradiometer (MODIS), an instrument on NASA’s Terra and Aqua satellites”. Temperatures in the data range from -25 ºC (-13 ºF) to 45 ºC (113 ºF).

The data are available at resolutions of 1.0, 0.5, 0.25, and 0.1 degrees. Degrees of latitude are approximately 69 miles (111 kilometers) apart, so the 0.1 degrees files contain land temperature readings spaced approximately 6.9 miles (11.1 km) apart north and south. Unlike latitude, the distance between degrees of longitude varies by latitude. The distance is greatest at the equator and gradually shrinks to zero at the poles. For instance, at the equator, degrees of longitude are approximately 69 miles (111 km) apart; whereas, at 40º north and south, degrees of longitude are approximately 53 miles (85 km) apart. To take advantage of the most fine-grained data available, we’ll use the 0.1 degrees files in this post.

nasa_neo_land_surface_temp_website

Create Environment

To begin, let’s create a dedicated folder and Python environment for this project. The following commands create a new folder named land_temperature and, inside it, another folder named input_files and then move you into the land_temperature folder:

mkdir -p land_temperature/input_files

cd land_temperature

The following conda commands create and activate a new Python 3.5 environment named land_temp that includes the listed packages, as well as their dependencies. If you’re not using the Anaconda distribution of Python, you can use the venv module in Python 3’s standard library to create a similar dedicated environment:

conda create --name land_temp python=3.5 pandas xarray scrapy matplotlib seaborn cartopy jupyter

source activate land_temp

Create Web Page URLs

Now that we’ve activated our dedicated Python environment, let’s inspect NASA NEO’s land surface temperature web page URL to determine how we’ll need to change it to access all of the web pages we need to visit. The URL for the month-level data for 2001-01-01 is:

"https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date=2001-01-01"

If you change the 2001 to 2002 and refresh the page, you’ll see you’re now viewing the month-level data for 2002-01-01. If you make a similar change to the month, you’ll see you’re now viewing data for a different month. It appears we can simply change the date in the URL to access all of the month-level files from 2001 to 2016. Let’s use pandas in the ipython interactive shell to generate this list of URLs:

ipython

import pandas as pd

start_date = '2001-01-01'
end_date = '2016-12-01'

dates = pd.date_range(start=start_date, end=end_date, freq='MS')

dates = [dt.strftime('%Y-%m-%d') for dt in dates]

url_base = "https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date="

urls = [url_base+dt for dt in dates]

create_webpage_urls

Inspect Web Page HTML

Now that we have the web page URLs we need, let’s use Chrome’s element inspection tool and the scrapy interactive shell to determine how to extract the links to the data files from the web pages. To start, let’s click on the File Type dropdown menu to see what file types are available. There are several options, but let’s plan to download the type, CSV for Excel.

Below the File Type dropdown menu, there are four geographic resolution options, 1.0, 0.5, 0.25, and 0.1 degrees, which provide increasingly granular data. Let’s right-click on the tiny, right-facing arrow to the right of 0.1 degrees 3600 x 1800 and select Inspect to inspect the HTML near the link in Chrome’s element inspection tool.

chrome_inspection_tools

The HTML shows us the link to the data file is in a table. Moreover, the link is in a row that has class=”size-option” and, within the data cell (td) element, it is in a hyperlink (a) element’s href attribute. With this understanding of the HTML path to the data file link, let’s use scrapy’s interactive shell to figure out how to extract the link:

scrapy shell "https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date=2001-01-01"

response.css('tr.size-option td a::attr(href)').extract()

response.css('tr.size-option td a::attr(href)')[-1].extract()

scrapy_extract_links

If you inspect a few of the data file links, you’ll notice an issue with them (i.e. a number in the middle of the URL that varies) that we need to address if we want to download the files programmatically:

"http://neo.sci.gsfc.nasa.gov/servlet/RenderData?si=869628&cs=rgb&format=SS.CSV&width=3600&height=1800"

In the previous section, when we generated the web page URLs, the portion of the URL that needed to change was the date at the end of the URL, and it needed to change in an understandable way. In this case, I don’t know which number is associated with each URL (and I can’t guess the underlying pattern if there is one), so I can’t generate them programmatically. Instead of generating the data file links like the web page URLs in the previous section, let’s simply scrape the actual data file links from the web pages.

Scrape Data File URLs

Now that we know how to select the data file links, let’s use scrapy to extract them from the web pages so we can then use them to download the data files. In total, there will be 192 URLs and files (12 months per year x 16 years = 192 monthly files).

From inside the land_temperature folder, type the following commands:

scrapy startproject scrape_land_temps

cd scrape_land_temps

Now that we’re inside the first scrape_land_temps folder, let’s create a scrapy spider, a Python file, named land_temp_csv_files_spider.py inside the scrape_land_temps/spiders folder. In the spider, let’s combine our web page URL generation code with our href link extraction code to instruct the spider to visit each of the 192 month-level web pages and extract the link to the 0.1 degrees data file from each page. Then we can use these URLs to download the CSV files:

import scrapy
import pandas as pd

class LandTempCSVFilesSpider(scrapy.Spider):
        name = "land_temp_csv_files"

        def start_requests(self):
                start_date = '2001-01-01'
                end_date = '2016-12-01'
                dates = pd.date_range(start=start_date, end=end_date, freq='MS')
                dates = [dt.strftime('%Y-%m-%d') for dt in dates]
                url_base = 'https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA&date='
                urls = [url_base+dt for dt in dates]

                for url in urls:
                        yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
                dt = response.url.split("=")[-1]
                url = response.css('tr.size-option td a::attr(href)')[-1].extract()
                url = url.replace('JPEG', 'SS.CSV')
                yield {'date': dt, 'url': url}

scrapy_spider

Let’s use the following command to run the spider and extract the links to the data files. The result is a JSON file named land_temp_csv_files_urls.json that contains an array of 192 objects, each containing a date and the link to the data file associated with the date:

scrapy crawl land_temp_csv_files -o ../land_temp_csv_files_urls.json

cd ..

Download Data Files

We’re finally ready to download the 192 month-level land surface temperature data files. Let’s return to the ipython interactive shell and use the following code to iterate through the array of URLs in our JSON file to download the CSV files.

First, we read the pairs of dates and URLs in the JSON file into a dataframe named ‘df’. Next, we loop over these pairs (i.e. rows in the dataframe) and, for each one, use the URL to read the remote CSV file into a dataframe named ‘dat’ and then write the dataframe to a local file in the input_files folder.

We insert the date, e.g. 2001-01-01, into the filenames so we know which date each file represents. Also, we use try-except blocks around the reading and writing operations so the loop won’t terminate if it runs into any issues (instead, it will print messages to the screen):

ipython

import pandas as pd

df = pd.read_json('land_temp_csv_files_urls.json')

for index, row in df.iterrows():
        print(index, row['url'])
        try:
                dat = pd.read_csv(row['url'], header=0, index_col=0)
        except:
                print('Error reading: {}'.format(row['url']))
                continue

        filename = 'MOD11C1_M_LSTDA_{}_rgb_3600x1800.SS.CSV'.format(row['date'].strftime('%Y-%m-%d'))
        try:
                dat.to_csv('input_files/'+filename)
        except:
                print('Error writing: {}'.format(filename))
                continue

download_and_write_csv_files

Combine Data

Now that we have all of our data files, let’s return to the ipython interactive shell and use the following code to read and combine all of the CSV files into a three-dimensional array (i.e. x = longitude, y = latitude, z = date):

import xarray as xr
import numpy as np
import pandas as pd
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
from pathlib import Path

The following code snippet is a helper function we’ll use to make the file-reading code shown below easier to read. This function takes a file name as input, splits it into pieces at the underscores, extracts the piece with index position 3 (this piece is the date, e.g. 2001-01-01), and converts the date into a datetime object:

def date_from_filename(filename):
        fn = filename.name
        dt = pd.to_datetime(fn.split('_')[3])
        return dt

The following code snippet is another helper function we’ll use to make the file-reading code easier to read. This function takes an array as input, converts all of the array elements into floating-point numbers, rounds all of the numbers to a specified number of decimal places (the default is 2 decimal places), and then converts the elements to string type:

def round_coords(arr, d=2):
        vals = map(str, [round(float(val), d) for val in arr])
        return vals

The following line of code uses the pathlib module in Python’s standard library to create and return a sorted list of paths to all of the CSV files in the input_files folder:

files = sorted(Path("input_files/").glob("*.CSV"))

The block of code shown below reads all of the CSV files and combines them into a three-dimensional array (i.e. x = longitude, y = latitude, z = date). We’ll use the list named ‘das’ to collect the 192 individual arrays. Later, we’ll pass this list of arrays to xarray’s concat function to concatenate them into a new, combined array. Similarly, we’ll use the list named ‘dts’ to collect the 192 dates so we can use them as the new dimension in the combined array.

Next, we start to loop through each of the CSV files. For each file, we use the date_from_filename function to extract the date from the filename and append it into the dts list. Next, we read the CSV file, noting that the first row is the header row of longitude values, the first column is the index of latitude values, and NA data values are coded as 99999.0. The next three lines round the data, longitude, and latitude values to two decimal places.

Next, we input these values into xarray’s DataArray constructor to create a two-dimensional array and add it to the das list. Finally, we use xarray’s concat function to combine the 192 two-dimensional arrays into a three-dimensional array with the new dimension named ‘date’:

das = []
dts = []
for input_file in files:
        dt = date_from_filename(input_file)
        dts.append(dt)
        df = pd.read_csv(input_file, header=0, index_col=0, na_values=99999.0)
        df = df.round(2)
        df.columns = round_coords(df.columns)
        df.index = round_coords(df.index)
        da = xr.DataArray(df.values,
                coords=[[float(lat) for lat in df.index], [float(lon) for lon in df.columns]],
                dims=['latitude', 'longitude'])
        das.append(da)

da = xr.concat(das, pd.Index(dts, name='date'))

xarray_read_all_csv_files

At this point, we should have a three-dimensional array named ‘da’ we can use to analyze and visualize land surface temperatures from 2001 to 2016. Let’s check to make sure the array has the expected dimensions and appears to have the right content:

da.shape

da

The temperature values are in degrees Celsius. Nearly everyone in the world learns this temperature scale, except for people in the United States. Since many readers live outside the United States, I’m going to leave the values in degrees Celsius; however, converting them to degrees Fahrenheit is straightforward:

da.values = (da.values * 1.8) + 32
da = da.round(2)

Average Land Surface Temperatures

xarray extends pandas and numpy functionality to facilitate multi-dimensional indexing, grouping, and computing. As an example, we can calculate the average land surface temperatures across all 192 months and display them on a map with the following code:

da.mean(dim='date').plot(figsize=(10, 6));
plt.show()

192-month-average-plate-caree

The PlateCaree projection is a nice default, but let’s explore some other map projections.

cartography_west_wing

Remembering the scene about projections in the television show, The West Wing, here is the same data displayed on a map with the Mollweide projection:

plt.figure(figsize=(10, 6))
ax_p = plt.gca(projection=ccrs.Mollweide(), aspect='auto')
da.mean(dim='date').plot.imshow(ax=ax_p, transform=ccrs.PlateCarree());
plt.show()

192-month-average-mollweide

As an additional example, the following code block displays the data on a map with the Robinson projection. This example also illustrates some of the additional arguments you can supply to the plot.imshow function to customize the plot:

fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection=ccrs.Robinson(), aspect='auto')
da.mean(dim='date').plot.imshow(ax=ax, transform=ccrs.PlateCarree(),
                x='longitude', y='latitude',
                robust=True, # vmin=-25, vmax=40,
                cmap='RdBu_r',
                add_colorbar=True,
                extend='both');
plt.show()

192-month-average-robinson

By Specific Geographic Area

The previous examples displayed maps of the entire Earth. In some cases, you may only be interested in a specific segment of the globe. In these cases, you can use array indexing to filter for the subset of data you want or use cartopy’s set_extent function to restrict the map to a specific geographic area.

If you use array indexing, be sure to check the ordering of your array’s axes so you place your index values or ranges in the right positions. For example, in our ‘da’ array the ordered dimensions are date, latitude, and longitude (which we can check with da.shape), so the indexing in the following command selects all dates, latitudes between 20.05 and 50.05, and longitudes between -66.50 and -125.05:

usa = da.loc[:, 50.05:20.05, -125.05:-66.50]
usa.mean(dim='date').plot();
plt.show()

usa-192-month-average-plate-caree

Alternatively, we can use cartopy’s set_extent function to restrict the map to a specific segment of the globe:

plt.figure(figsize=(10, 6))
ax_p = plt.gca(projection=ccrs.LambertConformal(), aspect='auto')
ax_p.set_extent([-125.05, -66.50, 20.05, 50.05])
usa.mean(dim='date').plot.imshow(ax=ax_p, transform=ccrs.PlateCarree());
plt.show()

usa-192-month-average-lambert-conformal

By Month of the Year

The previous plots calculated average land surface temperatures across all 192 months, which doesn’t let us see temperature differences among months of the year, i.e. January, February, …, December. To calculate average temperatures for each month, we can use xarray’s groupby function to group our data by month of the year and then calculate average temperatures for these groups:

by_month = da.groupby(da.date.dt.month).mean(dim='date')
by_month.plot(x='longitude', y='latitude', col='month', col_wrap=4);
plt.show()

12-monthly-averages-plate-caree-square

By Season

In xarray’s documentation, Joe Hamman demonstrates how to calculate season averages with weighted averages that account for the fact that months have different numbers of days. Slightly adapting his code for our dataset, we can view how global land surface temperatures vary across seasons (to run the code shown below, you’ll first need to copy and paste Joe’s dpm dictionary and leap_year and get_dpm functions):

month_length = xr.DataArray(get_dpm(da.date.to_index(), calendar='noleap'),
                coords=[da.date], name='month_length')
weights = month_length.groupby('date.season') / month_length.groupby('date.season').sum()
np.testing.assert_allclose(weights.groupby('date.season').sum().values, np.ones(4))

da_weighted = (da * weights).groupby('date.season').sum(dim='date')

fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15,4))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
        if i == 3:
                da_weighted.sel(season=season).plot(
                        ax=axes[i], robust=True, cmap='RdBu_r', #'Spectral_r',
                        add_colorbar=True, extend='both')
        else:
                da_weighted.sel(season=season).plot(
                        ax=axes[i], robust=True, cmap='RdBu_r',
                        add_colorbar=False, extend='both')

for i, ax in enumerate(axes.flat):
        if i > 0:
                ax.axes.get_xaxis().set_ticklabels([])
                ax.axes.get_yaxis().set_ticklabels([])
                ax.set_ylabel('')
                ax.set_xlabel('')

axes[0].set_title('Weighted by DPM')

4-seasonal-averages-plate-caree-wide

Looping through Months of the Year

The previous examples generated static images. While you can certainly scan over the month of year and season-based plots to inspect differences among the time periods, it can be helpful to generate the plots in a loop so you can focus on a geographic area of interest and let the program handle transitioning from one time period to the next.

Since we’re going to loop over time periods, e.g. months of the year, I’d like to label each plot so we know which time period is being displayed. In our dataset, the months are numbered from 1 to 12. I want to be able to refer to January instead of month 1, so let’s create a dictionary that maps the month number to the corresponding name:

months = {
        1: 'January',
        2: 'February',
        3: 'March',
        4: 'April',
        5: 'May',
        6: 'June',
        7: 'July',
        8: 'August',
        9: 'September',
        10: 'October',
        11: 'November',
        12: 'December'
}

Next, let’s write a function that will generate each plot, e.g. one for each month of the year. Inside the function, let’s use matplotlib’s clf function to clear the current figure, create a plot axis with the Robinson projection, filter for the subset of arrays with the specified month of the year, create a plot of the average land surface temperatures in the month across all sixteen years, and finally use the name of the month as the plot title:

def plot_monthly(month_number):
        plt.clf()
        ax_p = plt.gca(projection=ccrs.Robinson(), aspect='auto')
        d = da.loc[da.date.dt.month == month_number]
        d.mean(dim='date').plot.imshow(ax=ax_p, transform=ccrs.PlateCarree())
        plt.title('{}'.format(months[month_number]))

To generate the twelve month-based plots, let’s use matplotlib’s ion and figure functions to turn on matplotlib’s interactive mode and to create an initial figure. Next, let’s establish a for loop to iterate through the integers 1 to 12. As we loop through the integers, we’ll pass each one into our plot_monthly function so it creates a plot based on the data for that month. Since we’re using interactive mode, we need to use matplotlib’s pause function, which pauses the figure for the specified number of seconds, to facilitate the transition behavior. Similarly, we need to use the draw function to update the figure after each transition:

plt.ion()
plt.figure(figsize=(10, 6))
for month in range(1,13):
        plot_monthly(month)
        plt.pause(0.1)
        plt.draw()

12-monthly-averages

Conclusion

This post demonstrated how to acquire, analyze, and visualize sixteen years’ worth of global land surface temperature data with Python. Along the way, it illustrated ways you can use xarray, matplotlib, and cartopy to select, group, aggregate, and plot multi-dimensional data.

The data set we used in this post required a considerable portion of my laptop’s memory, but it still fit in memory. When the data set you want to use doesn’t fit in your computer’s memory, you may want to consider the Python package, Dask, “a flexible parallel computing library for analytic computing”. Dask extends numpy and pandas functionality to larger-than-memory data collections, such as arrays and data frames, so you can analyze your larger-than-memory data with familiar commands.

Finally, while we focused on land surface temperature data in this post, you can use the analysis and visualization techniques we covered here on other data sets. In fact, you don’t even have to leave the website we relied on for this post, NASA’s NEO website. It offers dozens of other environmental data sets, categorized under atmosphere, energy, land, life, and ocean. This post only scratched the surface of what is possible with xarray and NASA’s data. I look forward to hearing about the cool, or hot, ways you use these resources to study our planet : )

Advertisements

Lives On The [Fault] Line: A Geospatial Analysis of the San Andreas Fault in Python

Have you watched the action-packed movie, San Andreas? In the movie, California’s San Andreas Fault triggers a devastating, magnitude nine earthquake, the largest in the state’s history. Rather than critique the movie’s thrilling scenes, in this post, I’d like to explore answers to the question, how many people live within ¼, ½, ¾, and 1 mile of the fault, to demonstrate conducting a geospatial analysis in Python.

To start, let’s set up a dedicated analysis environment and download the input data, including shapefiles for California’s census tracts and the San Andreas Fault, as well as 2016 population data for the census tracts.

Project Environment

To create a dedicated analysis environment, let’s create a new folder for our files and an isolated Python environment with conda or virtualenv. The following commands create a new folder named san_andreas and activate an isolated Python 2.7 environment named geo that contains packages we’ll need, such as pandas, matplotlib, and geopandas.

mkdir san_andreas
cd san_andreas

conda create -n geo python=2.7 pandas matplotlib gdal geopandas ipython jupyter notebook

source activate geo

Now that we’ve set up our analysis environment, let’s download the shapefiles and population data we’ll need for our analysis.

Data

Download Quaternary Faults Shapefile

The geographic data for the San Andreas Fault are available within a shapefile available at the U.S. Geological Survey: https://earthquake.usgs.gov/hazards/qfaults/ The shapefile contains information on many large faults and associated folds in the United States, so we’ll have to extract the specific records associated with the San Andreas fault. For now, let’s download and unzip the USGS’s file with the following two commands:

curl 'https://earthquake.usgs.gov/static/lfs/nshm/qfaults/qfaults.zip' -o qfaults.zip
unzip qfaults.zip -d qfaults

Citation: U.S. Geological Survey and California Geological Survey, 2006, Quaternary fault and fold database for the United States, accessed Jan 4, 2018, from USGS web site: https://earthquake.usgs.gov/hazards/qfaults/

qfaults

Download California Census Tracts Shapefile

The geographic data for California’s census tracts are available within a shapefile available at the U.S. Census Bureau: https://www2.census.gov/geo/tiger/GENZ2016/shp/ The Census Bureau provides cartographic boundary files, simplified representations of geographic areas, for various geographies, such as state, county, census tract, legislative district, school district, and block group. We’ll conduct our analysis at the census tract level so our measurements occur over relatively small geographic areas, but we’ll present our results at the county level since people are more familiar with California’s counties. Let’s download and unzip the Census Bureau’s file with the following two commands:

curl 'https://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_06_tract_500k.zip' -o cb_2016_06_tract_500k.zip
unzip cb_2016_06_tract_500k.zip -d cb_2016_06_tract_500k

The filename describes the data in the file. It’s a cartographic boundary (cb_) file from 2016 (2016_) for the State of California (06_) at the census tract level (tract_) at a resolution level of 1:500,000 (500k).

california_census_tracts

Download California Census Tracts’ Populations

The 2016 population data for California’s census tracts are available from the U.S. Census Bureau’s American Community Survey. You can use the American FactFinder’s Guided Search (https://factfinder.census.gov/faces/nav/jsf/pages/guided_search.xhtml) to download the data, or you can use the following command:

curl 'https://api.census.gov/data/2016/acs/acs5?get=B01003_001E&for=tract:*&in=state:06' -o cb_2016_06_tract_B01003.json

The API call describes the data we’re requesting. We’re requesting 2016 total population estimates (B01003_001E) for California’s (state:06) census tracts (tract:*) from the American Community Survey’s 2012-2016 5-year Estimates (acs5). You can view additional API call examples for the ACS’s 5-year estimates at: https://api.census.gov/data/2016/acs/acs5/examples.html

california_census_tracts_population

Geospatial Analysis

We’re finally ready to begin our analysis and estimate how many people live within ¼, ½, ¾, and 1 mile of the San Andreas Fault! To begin, let’s open IPython or a Jupyter Notebook and import the packages we’ll need:

ipython

Let’s import geopandas, pandas, and matplotlib. We’ll need geopandas to read and write spatial data, manage data projections (i.e. mapping coordinates to locations on Earth), and to merge, manipulate, and aggregate spatial data. We’ll need pandas to read the population data and to select, merge, and manage multiple data files. We’ll need matplotlib to create plots of the data and geometries.

from geopandas import read_file
import pandas as pd
import matplotlib.pyplot as plt

Process San Andreas Fault Shapefile

Now we can import, select, and clean the data associated with the San Andreas Fault. We’ll use geopandas’ read_file function to read the shapefile. The file contains data for several large faults and folds in the United States, so let’s search for “san andreas” in the faultname column to filter for the data associated with the San Andreas Fault. The file also includes more columns than we need, so let’s select and rename the columns we want to retain. Finally, let’s use geopandas’ to_crs function to project the data to EPSG:3310, California Albers, which is appropriate for displaying and calculating distances in California.

qfaults = read_file('/Users/clinton/Downloads/qfaults/qfaults.shp')
san_andreas = qfaults.loc[qfaults['faultname'].str.contains('san andreas', case=False), :]
san_andreas_columns_to_keep = ['fault_id', 'section_id', 'faultname', 'sectionnam', 'geometry']
san_andreas = san_andreas[san_andreas_columns_to_keep]
san_andreas.columns = ['fault_id', 'section_id', 'fault_name', 'section_name', 'geometry']
san_andreas = san_andreas.to_crs('+init=epsg:3310')

san_andreas_fault

Create San Andreas Fault Buffers

We’re going to create buffers of varying distances around the San Andreas linestring to calculate the amount of overlap between each buffer and census tract. We’ll use this amount of overlap to estimate the portion of the population in each census tract that’s within a specific distance of the fault. Since we’re going to create several buffers, let’s write a function to create the buffers.

Inside the function, create_mp_buffer, we use geopandas’ buffer method to create a buffer around the San Andreas linestring that’s a specific number of meters away from the linestring’s coordinates. Once we’ve created this new set of geometries, we use geopandas’ unary_union method to combine them into a single multipolygon.

def create_mp_buffer(geo, meters):
    segments_with_buffers = geo.buffer(meters)
    multi_polygon = segments_with_buffers.unary_union
    return multi_polygon

Now that we have a function to create buffers around the San Andreas Fault, let’s use it to create buffers that are ¼, ½, ¾, and 1 mile away from the fault. The function uses meters instead of miles, so the numbers in the functions are the respective distances in meters.

quarter_mile = create_mp_buffer(san_andreas, 402.336)
half_mile = create_mp_buffer(san_andreas, 804.672)
three_quarter_mile = create_mp_buffer(san_andreas, 1207.008)
one_mile = create_mp_buffer(san_andreas, 1609.34)

san_andreas_buffers_in_san_mateo_and_san_bernardino_counties

Process California Census Tracts Shapefile

Now that we’ve processed the fault data, let’s turn our attention to the California census tracts data. The processing is similar to the fault data. We read the shapefile, rename the columns, convert the county and tract IDs to integers, and project the data to EPSG:3310, California Albers. We need to convert the county and tract IDs to a specific data type to facilitate the merge between these data and the population data. Finally, we need to project these data to California Albers because all of our geographic data need to be in the same projection to ensure our geometric manipulations, set operations, and distance calculations are correct for our area of interest, California.

ca_tracts = read_file('/Users/clinton/Downloads/cb_2016_06_tract_500k/cb_2016_06_tract_500k.shp')
ca_tracts.columns = ['state_id', 'county_id', 'tract_id', 'aff_geo_id', 'geo_id', 'tract_id_float', 'lsad', 'land_area', 'water_area', 'geometry']
ca_tracts['county_id'] = ca_tracts.county_id.astype(int)
ca_tracts['tract_id'] = ca_tracts.tract_id.astype(int)
ca_tracts = ca_tracts.to_crs('+init=epsg:3310')

Process California Census Tracts’ Populations

Now we can turn our attention to the population data. Let’s use pandas to read the data into a data frame, skipping the first row and selecting the population, county ID, and tract ID columns. Finally, let’s convert the population data into floating-point numbers and the county and tract IDs into integers to facilitate calculations and data frame merges, respectively.

ca_tracts_population = pd.read_json('/Users/clinton/Downloads/cb_2016_06_tract_500k/cb_2016_06_tract_B01003.json')
ca_tracts_population = ca_tracts_population.iloc[1:,[0,2,3]]
ca_tracts_population.columns = ['population_2016', 'county_id', 'tract_id']
ca_tracts_population['population_2016'] = ca_tracts_population.population_2016.astype(float)
ca_tracts_population['county_id'] = ca_tracts_population.county_id.astype(int)
ca_tracts_population['tract_id'] = ca_tracts_population.tract_id.astype(int)

Merge California Census Tracts and Populations

Now that we have a GeoDataFrame with California’s census tracts and a separate DataFrame with the census tracts’ 2016 population values, let’s merge the two data frames so all of the data are in one GeoDataFrame. Since there are similar tract ID numbers for different counties, e.g. county 1 tract 1 and county 2 tract 1, we need to merge the data frames on both county ID and tract ID.

ca_tracts_merged = ca_tracts.merge(ca_tracts_population, on=['county_id', 'tract_id'])

Calculate Populations In The ¼, ½, ¾, and 1 Mile Buffers

Now that we have our California census tracts data and our four San Andreas Fault buffers, let’s calculate, for each buffer region, how much of the buffer area overlaps with each census tract area. Then we can multiply the amount of area overlap by the census tract population to estimate the number of people in the census tract who live within that distance of the San Andreas Fault.

This calculation assumes the population is evenly distributed across the census tract, which isn’t necessarily true, so the result is only an approximation. At the same time, we’re using census tracts instead of counties for this calculation because, since their geographic areas are smaller, the error in this assumption shouldn’t be as great as it would be with counties.

The following for loop iterates over the four fault buffers (i.e. ¼, ½, ¾, and 1 mile from the fault) and, for each one, calculates the area of intersection between the buffer and each census tract, divides the intersection area by the census tract area to calculate the fraction of the census tract area contained in the intersection, and then multiplies this decimal number by the census tract population to estimate the number of people who live within the specified distance from the fault. The code also adds all of these calculated geometries and values as columns in a new GeoDataFrame named merged.

overlap_mps = [quarter_mile, half_mile, three_quarter_mile, one_mile]
overlap_mps_str = ['quarter_mile', 'half_mile', 'three_quarter_mile', 'one_mile']

for idx, mp in enumerate(overlap_mps):
    overlap = ca_tracts_merged['geometry'].intersection(mp)
    overlap.name = overlap_mps_str[idx]
    if idx == 0:
        merged = ca_tracts_merged.join(overlap)
        merged['tract_area'] = merged.geometry.area
    else:
        merged = merged.join(overlap)
    merged[overlap_mps_str[idx]+'_buffer_area'] = [geo.area for geo in merged[overlap_mps_str[idx]]]
    merged[overlap_mps_str[idx]+'_pct_overlap'] = merged[overlap_mps_str[idx]+'_buffer_area'] /     merged['tract_area']
    merged[overlap_mps_str[idx]+'_affected_pop'] = [round(val) for val in     merged[overlap_mps_str[idx]+'_pct_overlap'] * merged['population_2016']]

Up to this point, we’ve been focused on the census tracts so we haven’t concerned ourselves with having easy-to-read county names. However, since people are more familiar with counties than census tracts, let’s map the county IDs to county names so we can present the results at the county level. Let’s extract the county ID from the geo_id and then create a new column named county that contains the county name, mapped from a dictionary that associates county IDs with county names. The comment line shows where we need to create the dictionary, but I’m going to provide the dictionary at the bottom of this post because it’s long and may be distracting here.

merged['county_id'] = merged['geo_id'].str.slice(2,5)
# CREATE county_mapping HERE
merged['county'] = merged['county_id'].apply(lambda id: county_mapping[id])

Aggregate Data To County Level

Now that we have a column of county names, we can use geopandas’ dissolve function to aggregate the data from the census tract level to the county level. We’ll use the sum function to sum the population values for each of the distances from the fault within each county.

counties = merged.dissolve(by='county', aggfunc='sum')

Results

We’re finally in a position to explore answers to the question that prompted this analysis, namely, how many people live within ¼, ½, ¾, and 1 mile from the San Andreas Fault! First, let’s review the state-wide results. The results suggest that approximately 120,000 people live within ¼ mile, 209,000 people live within ½ mile, 300,000 people live within ¾ mile, and 389,000 people live within one mile of the fault.

counties.loc[counties['one_mile_affected_pop'] > 0.0, ['quarter_mile_affected_pop', 'half_mile_affected_pop', 'three_quarter_mile_affected_pop', 'one_mile_affected_pop']].sum()

population_within_distances_of_san_andreas_fault

Next, let’s review the results by county, for counties where the approximate number of people living within one mile of the fault is greater than 1,000. The results suggest the four counties with the most people living close to the fault are San Mateo, San Bernardino, Los Angeles, and Riverside, with the close populations numbering in the tens of thousands. The remaining counties with close populations over 1,000 include Santa Cruz, Santa Clara, Kern, San Benito, San Luis Obispo, Sonoma, Marin, and Monterey.

counties.loc[counties['one_mile_affected_pop'] > 1000.0, ['quarter_mile_affected_pop', 'half_mile_affected_pop', 'three_quarter_mile_affected_pop', 'one_mile_affected_pop']].sort_values(by=['one_mile_affected_pop'], ascending=False)

populations_within_distances_of_san_andreas_fault_by_county

Conclusion

This post explored the question of how many people live within ¼, ½, ¾, and 1 mile from the San Andreas Fault to demonstrate how to use geopandas to conduct a geospatial analysis in Python. The post is meant to illustrate the functionality you can use to explore interesting geospatial questions, rather than provide robust answers to this specific question. There are many other applications for this type of analysis, e.g. exploring the number of people or houses near a coastline, a roadway or transit line, or a utility line. I hope this post has piqued your interest in conducting your own geospatial analysis. If you do have an example to share, please share it because I enjoy reading about others’ projects. Thank you for reading!

San Andreas Fault

By IkluftOwn work, GFDL, Link

county_mapping = {
'001': 'Alameda',
'003': 'Alpine',
'005': 'Amador',
'007': 'Butte',
'009': 'Calaveras',
'011': 'Colusa',
'013': 'Contra Costa',
'015': 'Del Norte',
'017': 'El Dorado',
'019': 'Fresno',
'021': 'Glenn',
'023': 'Humboldt',
'025': 'Imperial',
'027': 'Inyo',
'029': 'Kern',
'031': 'Kings',
'033': 'Lake',
'035': 'Lassen',
'037': 'Los Angeles',
'039': 'Madera',
'041': 'Marin',
'043': 'Mariposa',
'045': 'Mendocino',
'047': 'Merced',
'049': 'Modoc',
'051': 'Mono',
'053': 'Monterey',
'055': 'Napa',
'057': 'Nevada',
'059': 'Orange',
'061': 'Placer',
'063': 'Plumas',
'065': 'Riverside',
'067': 'Sacramento',
'069': 'San Benito',
'071': 'San Bernardino',
'073': 'San Diego',
'075': 'San Francisco',
'077': 'San Joaquin',
'079': 'San Luis Obispo',
'081': 'San Mateo',
'083': 'Santa Barbara',
'085': 'Santa Clara',
'087': 'Santa Cruz',
'089': 'Shasta',
'091': 'Sierra',
'093': 'Siskiyou',
'095': 'Solano',
'097': 'Sonoma',
'099': 'Stanislaus',
'101': 'Sutter',
'103': 'Tehama',
'105': 'Trinity',
'107': 'Tulare',
'109': 'Tuolumne',
'111': 'Ventura',
'113': 'Yolo',
'115': 'Yuba'
}

Parsing PDFs in Python with Tika

A few months ago, one of my friends asked me if I could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few Python-based PDF parsers and decided to try Tika, which is a port of Apache Tika.  Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided to write a blog post about parsing PDFs with Tika.

tika

California Budget PDFs

To demonstrate parsing PDFs with Tika, I knew I’d need some PDFs. I was thinking about which ones to use and remembered a blog post I’d read on scraping budget data from a government website. Governments also provide data in PDF format, so I decided it would be helpful to demonstrate how to parse data from PDFs available on a government website. This way, with these two blog posts, you have examples of acquiring government data, even if it’s embedded in HTML or PDFs. The three PDFs we’ll parse in this post are:

2015-16 State of California Enacted Budget Summary Charts
2014-15 State of California Enacted Budget Summary Charts
2013-14 State of California Enacted Budget Summary Charts

ca_budget

Each of these PDFs contains several tables that summarize total revenues and expenditures, general fund revenues and expenditures, expenditures by agency, and revenue sources. For this post, let’s extract the data on expenditures by agency and revenue sources. In the 2015-16 Budget PDF, the titles for these two tables are:

2015-16 Total State Expenditures by Agency

expenditures

2015-16 Revenue Sources

revenues

To follow along with the rest of this tutorial you’ll need to download the three PDFs and ensure you’ve installed Tika. You can download the three PDFs here:

http://www.ebudget.ca.gov/2015-16/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2014-15/pdf/Enacted/BudgetSummary/SummaryCharts.pdf
http://www.ebudget.ca.gov/2013-14/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

You can install Tika by running the following command in a Terminal window:

pip install --user tika

IPython

Before we dive into parsing all of the PDFs, let’s use one of the PDFs, 2015-16CABudgetSummaryCharts.pdf, to become familiar with Tika and its output. We can use IPython to explore Tika’s output interactively:

ipython

from tika import parser

parsedPDF = parser.from_file("2015-16CABudgetSummaryCharts.pdf")

You can type the name of the variable, a period, and then hit tab to view a list of all of the methods available to you:

parsedPDF.

ipython1

There are many options related to keys and values, so it appears the variable contains a dictionary. Let’s view the dictionary’s keys:

parsedPDF.viewkeys()

parsedPDF.keys()

The dictionary’s keys are metadata and content. Let’s take a look at the values associated with these keys:

parsedPDF["metadata"]

The value associated with the key “metadata” is another dictionary. As you’d expect based on the name of the key, its key-value pairs provide metadata about the parsed PDF.

ipython2

Now let’s take a look at the value associated with “content”.

parsedPDF["content"]

The value associated with the key “content” is a string. As you’d expect, the string contains the PDF’s text content.

ipython3

Now that we know the types of objects and values Tika provides to us, let’s write a Python script to parse all three of the PDFs. The script will iterate over the PDF files in a folder and, for each one, parse the text from the file, select the lines of text associated with the expenditures by agency and revenue sources tables, convert each of these selected lines of text into a Pandas DataFrame, display the DataFrame, and create and save a horizontal bar plot of the totals column for the expenditures and revenues. So, after you run this script, you’ll have six new plots, one for revenues and one for expenditures for each of the three PDF files, in the folder in which you ran the script.

Python Script

To parse the three PDFs, create a new Python script named parse_pdfs_with_tika.py and add the following lines of code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import glob
import os
import re
import sys
import pandas as pd
import matplotlib
matplotlib.use('AGG')
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'

from tika import parser

input_path = sys.argv[1]

def create_df(pdf_content, content_pattern, line_pattern, column_headings):
    """Create a Pandas DataFrame from lines of text in a PDF.

    Arguments:
    pdf_content -- all of the text Tika parses from the PDF
    content_pattern -- a pattern that identifies the set of lines
    that will become rows in the DataFrame
    line_pattern -- a pattern that separates the agency name or revenue source
    from the dollar values in the line
    column_headings -- the list of column headings for the DataFrame
    """
    list_of_line_items = []
    # Grab all of the lines of text that match the pattern in content_pattern
    content_match = re.search(content_pattern, pdf_content, re.DOTALL)
    # group(1): only keep the lines between the parentheses in the pattern
    content_match = content_match.group(1)
    # Split on newlines to create a sequence of strings
    content_match = content_match.split('\n')
    # Iterate over each line
    for item in content_match:
        # Create a list to hold the values in the line we want to retain
        line_items = []
        # Use line_pattern to separate the agency name or revenue source
        # from the dollar values in the line
        line_match = re.search(line_pattern, item, re.I)
        # Grab the agency name or revenue source, strip whitespace, and remove commas
        # group(1): the value inside the first set of parentheses in line_pattern
        agency = line_match.group(1).strip().replace(',', '')
        # Grab the dollar values, strip whitespace, replace dashes with 0.0, and remove $s and commas
        # group(2): the value inside the second set of parentheses in line_pattern
        values_string = line_match.group(2).strip().\
        replace('- ', '0.0 ').replace('$', '').replace(',', '')
        # Split on whitespace and convert to float to create a sequence of floating-point numbers
        values = map(float, values_string.split())
        # Append the agency name or revenue source into line_items
        line_items.append(agency)
        # Extend the floating-point numbers into line_items so line_items remains one list
        line_items.extend(values)
        # Append line_item's values into list_of_line_items to generate a list of lists;
        # all of the lines that will become rows in the DataFrame
        list_of_line_items.append(line_items)
    # Convert the list of lists into a Pandas DataFrame and specify the column headings
    df = pd.DataFrame(list_of_line_items, columns=column_headings)
    return df

def create_plot(df, column_to_sort, x_val, y_val, type_of_plot, plot_size, the_title):
    """Create a plot from data in a Pandas DataFrame.

    Arguments:
    df -- A Pandas DataFrame
    column_to_sort -- The column of values to sort
    x_val -- The variable displayed on the x-axis
    y_val -- The variable displayed on the y-axis
    type_of_plot -- A string that specifies the type of plot to create
    plot_size -- A list of 2 numbers that specifies the plot's size
    the_title -- A string to serve as the plot's title
    """
    # Create a figure and an axis for the plot
    fig, ax = plt.subplots()
    # Sort the values in the column_to_sort column in the DataFrame
    df = df.sort_values(by=column_to_sort)
    # Create a plot with x_val on the x-axis and y_val on the y-axis
    # type_of_plot specifies the type of plot to create, plot_size
    # specifies the size of the plot, and the_title specifies the title
    df.plot(ax=ax, x=x_val, y=y_val, kind=type_of_plot, figsize=plot_size, title=the_title)
    # Adjust the plot's parameters so everything fits in the figure area
    plt.tight_layout()
    # Create a PNG filename based on the plot's title, replace spaces with underscores
    pngfile = the_title.replace(' ', '_') + '.png'
    # Save the plot in the current folder
    plt.savefig(pngfile)

# In the Expenditures table, grab all of the lines between Totals and General Government
expenditures_pattern = r'Totals\n+(Legislative, Judicial, Executive.*?)\nGeneral Government:'

# In the Revenues table, grab all of the lines between 2015-16 and either Subtotal or Total
revenues_pattern = r'\d{4}-\d{2}\n(Personal Income Tax.*?)\n +[Subtotal|Total]'

# For the expenditures, grab the agency name in the first set of parentheses
# and grab the dollar values in the second set of parentheses
expense_pattern = r'(K-12 Education|[a-z,& -]+)([$,0-9 -]+)'

# For the revenues, grab the revenue source in the first set of parentheses
# and grab the dollar values in the second set of parentheses
revenue_pattern = r'([a-z, ]+)([$,0-9 -]+)'

# Column headings for the Expenditures DataFrames
expense_columns = ['Agency', 'General', 'Special', 'Bond', 'Totals']

# Column headings for the Revenues DataFrames
revenue_columns = ['Source', 'General', 'Special', 'Total', 'Change']

# Iterate over all PDF files in the folder and process each one in turn
for input_file in glob.glob(os.path.join(input_path, '*.pdf')):
    # Grab the PDF's file name
    filename = os.path.basename(input_file)
    print filename
    # Remove .pdf from the filename so we can use it as the name of the plot and PNG
    plotname = filename.strip('.pdf')

    # Use Tika to parse the PDF
    parsedPDF = parser.from_file(input_file)
    # Extract the text content from the parsed PDF
    pdf = parsedPDF["content"]
    # Convert double newlines into single newlines
    pdf = pdf.replace('\n\n', '\n')

    # Create a Pandas DataFrame from the lines of text in the Expenditures table in the PDF
    expense_df = create_df(pdf, expenditures_pattern, expense_pattern, expense_columns)
    # Create a Pandas DataFrame from the lines of text in the Revenues table in the PDF
    revenue_df = create_df(pdf, revenues_pattern, revenue_pattern, revenue_columns)
    print expense_df
    print revenue_df

    # Print the total expenditures and total revenues in the budget to the screen
    print "Total Expenditures: {}".format(expense_df["Totals"].sum())
    print "Total Revenues: {}\n".format(revenue_df["Total"].sum())

    # Create and save a horizontal bar plot based on the data in the Expenditures table
    create_plot(expense_df, "Totals", ["Agency"], ["Totals"], 'barh', [20,10], \
    plotname+"Expenditures")
    # Create and save a horizontal bar plot based on the data in the Revenues table
    create_plot(revenue_df, "Total", ["Source"], ["Total"], 'barh', [20,10], \
    plotname+"Revenues")

Save this code in a file named parse_pdfs_with_tika.py in the same folder as the one containing the three CA Budget PDFs. Then you can run the script on the command line with the following command:

./parse_pdfs_with_tika.py .

I added docstrings to the two functions, create_df and create_plot, and comments above nearly every line of code in an effort to make the code as self-explanatory as possible. I created the two functions to avoid duplicating code because we perform these operations twice for each file, once for revenues and once for expenditures. We use a for loop to iterate over the PDFs and for each one we extract the lines of text we care about, convert the text into a Pandas DataFrame, display some of the DataFrame’s information, and save plots of the total values in the revenues and expenditures tables.

Results

Terminal Output
(1 of 3 pairs of DataFrames)

terminal_output

PNG File: Expenditures by Agency 2015-16
(1 of 6 PNG Files)

2015-16CABudgetSummaryChartsExpenditures

In this post I’ve tried to convey that Tika is a great resource for parsing PDFs by demonstrating how you can use it to parse budget data from PDF documents provided by a government agency. As my friend’s experience illustrates, there may be other situations in which you need to extract data from PDFs. With Tika, PDFs become another rich source of data for your analysis.

Pandashells: Data Science with Python on the Command Line

I often find myself using a variety of unix commands, perl / sed / awk one-liners, and snippets of Python code to combine, clean, analyze, and visualize data. Switching between the command line tools and Python breaks up my workflow because I have to step away from the command line to run the Python code in the interpreter or a script.

That’s why, when I learned about Pandashells last year I got excited because it’s a set of tools for using Python, Pandas, and other members of the Python data stack on the command line. Since Pandashells is a bash API to Pandas, Statsmodels, Seaborn, and other libraries, it’s easy to integrate the work you’d do with these Python packages into your command line workflow.

pandashells_overview

Pandashells has a range of tools that enable you to accomplish many common data processing, analysis, and visualization tasks. The main tool is p.df, which loads your tabular data into a Pandas dataframe so you can use your favorite Pandas commands right on the command line. In addition, p.merge enables you to merge files. p.linspace and p.rand enable you to create linearly-spaced and random numbers. p.regress and p.regplot, p.plot, p.hist, p.facet_grid, and p.cdf enable you to perform multivariate linear regression and create a collection of standard plots.

A nice feature of Pandashells is that it comes with several example datasets, including election, electoral_college, sealevel, and tips, and its GitHub page presents several well-commented examples that help you get familiar with the syntax. The examples show you how to chain multiple Pandashells commands and combine them with other command line tools like head.

pandashells_example_data

Ever since I started using Pandashells, I’ve enjoyed being able to integrate Python code into my command line workflow. I’ve used it to merge datasets, parse and re-format dates, filter for specific rows and columns, create new columns and dummy variables, explore and summarize the data, create graphs and plots, and prepare the data for predictive modeling.

In this post I’d like to demonstrate how to use Pandashells to accomplish a variety of common data processing, analysis, and visualization tasks on the command line. First, I’ll present the commands according to the tasks you’re going to accomplish so the commands are short and you can skip to the tasks you’re interested in. Then, at the end of this post, I’ll provide an example of chaining several commands to prepare a dataset for predictive modeling.

 

INSTALL PANDASHELLS

To follow along with this post, you will need to install Pandashells, which you can do with one of the following commands, documented on its GitHub page:

pip install pandashells

pip install pandashells[console]

pip install pandashells[full]

conda install -c https://conda.anaconda.org/robdmc pandashells

 

 

DATA

The dataset we’ll use in these examples is the familiar customer churn dataset. You can download a copy of the dataset here: churn.csv

pandashells_churn_dataset

 

TUTORIAL

DELETE AN APOSTROPHE

To begin, one of the column headings in the file, Int’l Plan, contains an apostrophe. Pandashells has trouble with apostrophes in column headings because you enclose Pandashells commands in apostrophes, so let’s delete it.

The first sed command prints the result to the screen so you can confirm that it’s going to delete the apostrophe in the column heading. The second sed command uses the -i flag to actually make the change to the file in-place.

sed -e "s/'//g" churn.csv | head

sed -i -e "s/'//g" churn.csv

 

VIEW FIRST FEW ROWS OF THE DATAFRAME

Now that the file is ready, let’s use Pandashells to read the data into a Pandas dataframe and take a look at the header row and the first five data rows.

cat churn.csv | p.df 'df.head()'

pandashells_head

 

VIEW NUMBER OF ROWS AND COLUMNS

How many rows and columns does the dataset have? The shape command returns the number of rows and columns as a tuple. However, the p.df tool wants to return output as a dataframe and it has trouble converting the tuple into a dataframe, so assign the row value into a new column called rows, assign the column value into a new column called columns, and view the numbers in the first row using head(1). The “-o table” means you want to output the results in table format.

cat churn.csv | p.df 'df["rows"], df["columns"] = df.shape' 'df[["rows", "columns"]]' 'df.head(1)' -o table

pandashells_rows_columns

 

CHANGE COLUMN HEADINGS

The column headings contain a mix of uppercase and lowercase letters and the headings that are two words contain a space between the words. Let’s standardize the column headings by changing all of them to uppercase and converting any spaces to underscores.

cat churn.csv | p.df 'df.rename(columns={"State":"STATE", "Account Length":"ACCOUNT_LENGTH", "Area Code":"AREA_CODE", "Phone":"PHONE", "Intl Plan":"INTL_PLAN", "VMail Plan":"VMAIL_PLAN", "VMail Message":"VMAIL_MESSAGE", "Day Mins":"DAY_MINS", "Day Calls":"DAY_CALLS", "Day Charge":"DAY_CHARGE", "Eve Mins":"EVE_MINS", "Eve Calls":"EVE_CALLS", "Eve Charge":"EVE_CHARGE", "Night Mins":"NIGHT_MINS", "Night Calls":"NIGHT_CALLS", "Night Charge":"NIGHT_CHARGE", "Intl Mins":"INTL_MINS", "Intl Calls":"INTL_CALLS", "Intl Charge":"INTL_CHARGE", "CustServ Calls":"CUSTSERV_CALLS", "Churn?":"CHURN?"})' 'df.head()'

 

REMOVE ROWS THAT CONTAIN NaNs

This dataset doesn’t contain NaNs, but here are examples of how to eliminate rows that contain NaNs. The first command eliminates rows that have NaNs in any columns. The second command ensures there aren’t NaNs in specific columns.

cat churn.csv | p.df 'df[df.notnull()]' 'df.head()'

cat churn.csv | p.df 'df[df["Churn?"].notnull()]' 'df[df["Account Length"].notnull()]' 'df.head()'

 

KEEP ROWS WHERE VALUES MEET CONDITIONS

You often want to filter a dataset for rows with values that meet specific conditions. The first command filters for rows where the Account Length is greater than 145. The second command filters for rows where the International Charge is less than 2 and the Day Charge is greater than 45.

cat churn.csv | p.df 'df[df["Account Length"] > 145]' 'df.head()'

cat churn.csv | p.df 'df[(df["Intl Charge"] < 2.0) & (df["Day Charge"] > 45.0)]' 'df.head()'

 

KEEP ROWS WHERE VALUES ARE / ARE NOT IN A SET

In some cases you want to filter for rows where the values in a column are or are not in a specific set. The first command filters for rows where the value in the International Plan column is “yes”. The second command uses a tilde ‘~’ to negate the expression and filter for rows where the value in the column is NOT “yes”. I’ve found this second syntax useful in situations where a column contains a small set of invalid values. Then you can use the second command to eliminate rows that have these invalid values in the column.

cat churn.csv | p.df 'df[df["Intl Plan"].isin(["yes"])]' 'df.head()'

cat churn.csv | p.df 'df[~df["Intl Plan"].isin(["yes"])]' 'df.head()'

 

KEEP ROWS WHERE VALUES MATCH A PATTERN

In some cases you want to filter for rows where the values in a column match a specific pattern. You can filter for rows matching a pattern using startswith, endswith, and contains to specify where to look for the pattern. The first command filters for rows where the first letter in the State column is a capital K. The second command finds rows where the text in the State column contains a capital K.

cat churn.csv | p.df 'df[df["State"].str.startswith("K")]' 'df.head()'

cat churn.csv | p.df 'df[df["State"].str.contains("K")]' 'df.head()'

 

KEEP SPECIFIC COLUMNS

Sometimes a dataset contains more columns than you need. You can specify which columns to retain by specifying them as a list. The following command restricts the output to nine specific columns.

cat churn.csv | p.df 'df[["Account Length", "Intl Plan", "VMail Plan", "Day Charge", "Eve Charge", "Night Charge", "Intl Charge", "CustServ Calls", "Churn?"]]' 'df.head()'

 

CREATE NEW VARIABLES / COLUMNS

One common operation is creating new columns. You can create a new column by writing an expression on the right hand side of the equals sign that generates the values for the column and then assigning the values to the new column you specify on the left hand side of the equals sign.

The first command uses the existing “Churn?” column to create a new column called “churn”. The values in the “Churn?” column are True. and False., so the expression uses NumPy’s “where” function to convert the True.s and False.s into 1s and 0s, respectively. The second command creates a new column called “total_calls” that is the sum of the values in the day, evening, night, and international calls columns. Similarly, the third command creates a new column called “total_charges” that is the sum of the values in the day, evening, night, and international charges columns.

cat churn.csv | p.df 'df["churn"] = np.where(df["Churn?"] == "True.", 1, 0)' 'df.head()'

cat churn.csv | p.df 'df["total_calls"] = df["Day Calls"] + df["Eve Calls"] + df["Night Calls"] + df["Intl Calls"]' 'df.head()'

cat churn.csv | p.df 'df["total_charges"] = df["Day Charge"] + df["Eve Charge"] + df["Night Charge"] + df["Intl Charge"]' 'df.head()'

 

CREATE CATEGORICAL VARIABLE FROM VALUES IN ANOTHER COLUMN

One fun operation is creating a column for a new categorical variable that’s based on values in another column. You can do so using a list comprehension and if-else logic. For example, the following command uses the values in the “State” column to create a new categorical variable called “us_regions” that categorizes the states into the Census Bureau’s four designated regions: Northeast, Midwest, South, and West. As another example, I’ve used this type of command create a categorical variable of failure types based on keywords / substrings in another column containing verbose failure descriptions.

cat churn.csv | p.df 'df["us_regions"] = ["Northeast" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "Midwest" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "South" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "West" for state in df["State"]]' 'df.head()'

 

CREATE INDICATOR / DUMMY VARIABLES

Sometimes you want to convert a categorical variable into a set of indicator / dummy variables and add them to the existing dataframe. You can use Pandas’ get_dummies() function to create dummy variables and the concat() function to add them as new columns to the existing dataframe. For example, the following command uses our previous code to create a categorical variable called “us_regions” and then uses the get_dummies() and concat() functions to create four new indicator variables based on the values in “us_regions” and add them to the existing dataframe.

cat churn.csv | p.df 'df["us_regions"] = ["Northeast" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "Midwest" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "South" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "West" for state in df["State"]]' 'pd.concat([df, pd.get_dummies(df.us_regions)], axis=1)' 'df.head()'

 

ENSURE SPECIFIC DATE FORMAT (YYYY-MM-DD HH:MM:SS)

This dataset doesn’t contain dates, but here’s an example of ensuring that the dates in a column have a specific format. I added a newline after the word “contains” to make the command easier to read, but you wouldn’t include the newline when you use the command.

cat my_data.csv | p.df 'df[df["date_column"].str.contains
("[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2} [0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2}")]'

 

RESTRICT TO SPECIFIC DATETIME RANGE

Here is an example of restricting your dataset to a specific date range. The following command ensures the values in the “date_column” column are more recent than “2009-12-31 23:59:59”. That is, it eliminates data from before 2010.

cat my_data.csv | p.df 'df[df["date_column"] > "2009-12-31 23:59:59"]'

 

CALCULATE NUMBER OF DAYS BETWEEN TWO DATES

Here is an example of calculating the number of days between two dates. The new column is called “diff_in_days”. The values in the column are the number of days between the values in two columns called “recent_date_column” and “older_date_column”. To calculate the difference, I use the strptime() function inside two list comprehensions to convert the text values in the two date columns into datetime objects. Next, I use the zip function to pair up the two datetime objects in each row. Finally, I use the expression “str(i-j).split()[0] if “days” in str(i-j) else 1″ to subtract one date, i, from the other, j, convert the result of the subtraction into a string, and split the string on whitespace and extract the number portion if it contains the word “days” otherwise assign the value 1. For example, if the result of the subtraction is “10 days” I want the new column “diff_in_days” to contain the number 10. I added several newlines to make the command easier to read, but you wouldn’t include the newlines when you use the command.

cat my_data.csv | p.df 'df["diff_in_days"] =
[str(i-j).split()[0] if "days" in str(i-j) else 1
for i, j in zip(
[datetime.datetime.strptime(recent_date, "%Y-%m-%d %H:%M:%S")
for recent_date in df.recent_date_column],
[datetime.datetime.strptime(older_date, "%Y-%m-%d %H:%M:%S")
for older_date in df.older_date_column])]'

pandashells_dates_csv

pandashells_dates_output

 

UNIQUE VALUES IN A COLUMN

Now we can return to analyzing our churn dataframe. The following command enables you to view the unique values in the “Churn?” column. The “-o table” option means you want to display the output in table format, as opposed to csv or another format, and the “index” option means you want to display the titles for the rows in the output.

cat churn.csv | p.df 'sorted(df["Churn?"].unique())' -o table index

 

VALUE COUNTS FOR UNIQUE VALUES IN A COLUMN

The following command enables you to view the unique values in the “VMail Plan” column, as well as the number of times each of the values appears in the dataset. As with the previous command, it’s helpful to display the output in table format and to display the titles for the rows in the output.

cat churn.csv | p.df 'df["VMail Plan"].value_counts()' -o table index

 

DESCRIPTIVE STATISTICS FOR A COLUMN

The following commands demonstrate how to compute descriptive statistics for categorical and numeric columns. The statistics for a categorical variable are the count of observations, the number of unique values in the column, the top / most frequently occurring value, and the frequency of the top occurring value. The statistics for a numeric variable are count of observations, mean, standard deviation, minimum, 25th percentile, 50th percentile / median, 75th percentile, and maximum. You can use “.T” to transpose the output, as shown in the second command.

cat churn.csv | p.df 'df[["Churn?"]].describe()' -o table index

cat churn.csv | p.df 'df[["Intl Charge"]].describe().T' -o table index

 

CROSSTABS

The following command shows how to create a crosstab table. The values in the “Churn?” column are the rows in the table, and the values in the “Intl Plan” column are the columns in the table. By default, the values in the table are counts of observations in each category, but you can specify additional data and an aggregation function to calculate different statistics for each category.

cat churn.csv | p.df 'pd.crosstab(df["Churn?"], df["Intl Plan"])' -o table index

 

GROUP BY

Sometimes you want to view separate statistics for different categories in a categorical variable, e.g. separate descriptive statistics for men and women. The following two commands show you how to use the groupby() function to group the data by the values in the “Churn?” column and calculate descriptive statistics for the two groups. The first command calculates descriptive statistics for a categorical variable, “Intl Plan”, separately for those who churned, “True.”, and those who did not churn, “False.”. Similarly, the second command calculates descriptive statistics for a numeric variable, “Intl Charge”.

cat churn.csv | p.df 'df.groupby("Churn?")[["Intl Plan"]].describe().unstack("Churn?")' -o table index

cat churn.csv | p.df 'df.groupby("Churn?")[["Intl Charge"]].describe().unstack("Churn?")' -o table index

 

PIVOT TABLES

The following two commands illustrate how to create pivot tables. Both commands display statistics about the values in the “Intl Charge” column, grouped by two categorical variables, “Churn?” and “Intl Plan”. The “Churn?” values are the rows in the output table and the “Intl Plan” values are the columns in the output table. The first command displays the “count” of the “Intl Charge” values in each of the categories, and the second command displays the “mean” of the “Intl Charge” values in each of the categories.

cat churn.csv | p.df 'df.pivot_table(values=["Intl Charge"], index=["Churn?"], columns=["Intl Plan"], aggfunc="count")' -o table index

cat churn.csv | p.df 'df.pivot_table(values=["Intl Charge"], index=["Churn?"], columns=["Intl Plan"], aggfunc="mean")' -o table index

 

BAR CHART

The following command combines the p.df tool with the p.hist tool to display a histogram of the values in the “churn” column. The command uses NumPy’s where() function to ensure the values in the new column are numeric. The “-o csv” option means the column should be outputted in csv format. I added a newline before the “–savefig” argument so it’s separate from the rest of the command, but you can remove the newline and include it in the command if you want to save the figure to a file called bar_chart.png in a folder called plots.

cat churn.csv | p.df 'df["churn"] = np.where(df["Churn?"] == "True.", 1, 0)' 'df["churn"]' -o csv | p.hist --ylabel 'Count' --xlabel 'Churn? (0: No; 1: Yes)' --title 'Bar Chart of Dependent Variable: Churn?' --theme 'darkgrid' --palette 'muted'
--savefig 'plots/bar_chart.png'

pandashells_bar_chart

 

FACET GRID

The following command shows how you can use facet grid to create separate plots based on a categorical variable. The –col “Intl Plan” argument indicates you want to create separate plots for the categories in the “Intl Plan” column. The –args “churn” argument indicates you want to display the “churn” data in the plots. The –map pl.hist argument indicates you want to display histograms of the “churn” data. Again, I added a newline before the “–savefig” argument.

cat churn.csv | p.df 'df["churn"] = np.where(df["Churn?"] == "True.", 1, 0)' 'df[["churn", "Intl Plan"]]' -o csv | p.facet_grid --col "Intl Plan" --args "churn" --map pl.hist
--savefig 'plots/bar_chart_facet.png'

pandashells_bar_chart_facet

 

WRITE DATA TO A FILE

A very common operation is to write your cleaned data to a new output file once you’re finished processing it. The following two commands show different ways to write to an output file. The first command uses Pandas’ to_csv() function to write the data to a file called “dataset_cleaned.csv”. I include the index=False argument so it doesn’t write an additional column of row index values to the output file. The second command uses the “-o csv” option to output the data in CSV format and the greater than sign to redirect the output into the output file.

cat churn.csv | p.df 'df[["Account Length", "Intl Plan", "VMail Plan", "Day Charge", "Eve Charge", "Night Charge", "Intl Charge", "CustServ Calls", "Churn?"]]' 'df.to_csv("dataset_cleaned.csv", index=False)'

cat churn.csv | p.df 'df[["Account Length", "Intl Plan", "VMail Plan", "Day Charge", "Eve Charge", "Night Charge", "Intl Charge", "CustServ Calls", "Churn?"]]' -o csv > dataset_clean.csv

 

PUTTING IT ALL TOGETHER

I’ve demonstrated a variety of ways to clean, analyze, and visualize your data on the command line with Pandashells. In the preceding examples, I refrained from chaining several commands together into a long workflow so we could focus on short snippets of code that accomplish individual tasks.

In this last example, I present an extended workflow that prepares our churn dataset for predictive modeling. I present the workflow twice. The first time I add newlines after each of the commands so each one and the entire workflow is easier to read. The second time I present the commands as you’d actually write them so it’s easy to copy and paste them into a Terminal window.

The first command renames the column headings to uppercase and replaces spaces with underscores. The second command deletes any rows containing NaN values.

The next three commands use NumPy’s where() function to create three new numeric variables based on columns that contain text values. The next two commands create two new variables that are the sum of the day, evening, night, and international calls and charges columns, respectively. The next command creates a categorical variable that categorizes the values in the States column into the Census Bureau’s four designated regions. The next command uses Pandas’ get_dummies() and concat() functions to create indicator variables for the four regions and add them to the existing dataframe.

The penultimate command selects a subset of the columns in the dataframe. The final command writes the eleven selected columns to an output file called “churn_cleaned.csv”. Since I didn’t add the argument index=False, the file also contains an additional column of row index values. The argument index_label=”ID” gives that column a heading.

NEWLINES AFTER EACH COMMAND

cat churn.csv | p.df
'df.rename(columns={"State":"STATE", "Account Length":"ACCOUNT_LENGTH", "Area Code":"AREA_CODE", "Phone":"PHONE", "Intl Plan":"INTL_PLAN", "VMail Plan":"VMAIL_PLAN", "VMail Message":"VMAIL_MESSAGE", "Day Mins":"DAY_MINS", "Day Calls":"DAY_CALLS", "Day Charge":"DAY_CHARGE", "Eve Mins":"EVE_MINS", "Eve Calls":"EVE_CALLS", "Eve Charge":"EVE_CHARGE", "Night Mins":"NIGHT_MINS", "Night Calls":"NIGHT_CALLS", "Night Charge":"NIGHT_CHARGE", "Intl Mins":"INTL_MINS", "Intl Calls":"INTL_CALLS", "Intl Charge":"INTL_CHARGE", "CustServ Calls":"CUSTSERV_CALLS", "Churn?":"CHURN?"})'
'df[df.notnull()]'
'df["CHURN"] = np.where(df["CHURN?"] == "True.", 1, 0)'
'df["INT_PLAN"] = np.where(df["INTL_PLAN"] == "yes", 1, 0)'
'df["VM_PLAN"] = np.where(df["VMAIL_PLAN"] == "yes", 1, 0)'
'df["TOTAL_CALLS"] = df["DAY_CALLS"] + df["EVE_CALLS"] + df["NIGHT_CALLS"] + df["INTL_CALLS"]'
'df["TOTAL_CHARGES"] = df["DAY_CHARGE"] + df["EVE_CHARGE"] + df["NIGHT_CHARGE"] + df["INTL_CHARGE"]'
'df["USA_REGIONS"] = ["NORTHEAST" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "MIDWEST" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "SOUTH" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "WEST" for state in df["STATE"]]'
'pd.concat([df, pd.get_dummies(df.USA_REGIONS)], axis=1)'
'df[["CHURN", "ACCOUNT_LENGTH", "INT_PLAN", "VM_PLAN", "TOTAL_CALLS", "TOTAL_CHARGES", "CUSTSERV_CALLS", "NORTHEAST", "MIDWEST", "SOUTH", "WEST"]]'
'df.to_csv("churn_cleaned.csv", index_label="ID")'

 

ACTUAL COMMANDS (READY FOR COPY AND PASTE)

cat churn.csv | p.df 'df.rename(columns={"State":"STATE", "Account Length":"ACCOUNT_LENGTH", "Area Code":"AREA_CODE", "Phone":"PHONE", "Intl Plan":"INTL_PLAN", "VMail Plan":"VMAIL_PLAN", "VMail Message":"VMAIL_MESSAGE", "Day Mins":"DAY_MINS", "Day Calls":"DAY_CALLS", "Day Charge":"DAY_CHARGE", "Eve Mins":"EVE_MINS", "Eve Calls":"EVE_CALLS", "Eve Charge":"EVE_CHARGE", "Night Mins":"NIGHT_MINS", "Night Calls":"NIGHT_CALLS", "Night Charge":"NIGHT_CHARGE", "Intl Mins":"INTL_MINS", "Intl Calls":"INTL_CALLS", "Intl Charge":"INTL_CHARGE", "CustServ Calls":"CUSTSERV_CALLS", "Churn?":"CHURN?"})' 'df[df.notnull()]' 'df["CHURN"] = np.where(df["CHURN?"] == "True.", 1, 0)' 'df["INT_PLAN"] = np.where(df["INTL_PLAN"] == "yes", 1, 0)' 'df["VM_PLAN"] = np.where(df["VMAIL_PLAN"] == "yes", 1, 0)' 'df["TOTAL_CALLS"] = df["DAY_CALLS"] + df["EVE_CALLS"] + df["NIGHT_CALLS"] + df["INTL_CALLS"]' 'df["TOTAL_CHARGES"] = df["DAY_CHARGE"] + df["EVE_CHARGE"] + df["NIGHT_CHARGE"] + df["INTL_CHARGE"]' 'df["USA_REGIONS"] = ["NORTHEAST" if ("CT" in str(state).upper() or "ME" in state or "MA" in state or "NH" in state or "RI" in state or "VT" in state or "NJ" in state or "NY" in state or "PA" in state) else "MIDWEST" if ("IL" in state or "IN" in state or "MI" in state or "OH" in state or "WI" in state or "IA" in state or "KS" in state or "MN" in state or "MO" in state or "NE" in state or "ND" in state or "SD" in state) else "SOUTH" if ("DE" in state or "FL" in state or "GA" in state or "MD" in state or "NC" in state or "SC" in state or "VA" in state or "DC" in state or "WV" in state or "AL" in state or "KY" in state or "MS" in state or "TN" in state or "AR" in state or "LA" in state or "OK" in state or "TX" in state) else "WEST" for state in df["STATE"]]' 'pd.concat([df, pd.get_dummies(df.USA_REGIONS)], axis=1)' 'df[["CHURN", "ACCOUNT_LENGTH", "INT_PLAN", "VM_PLAN", "TOTAL_CALLS", "TOTAL_CHARGES", "CUSTSERV_CALLS", "NORTHEAST", "MIDWEST", "SOUTH", "WEST"]]' 'df.to_csv("churn_cleaned.csv", index_label="ID")'

pandashells_workflow

With Pandashells, we were able to quickly read the raw data into a Pandas dataframe, clean the data, create new variables, filter for specific rows and columns, and write the cleaned data to a new output file without leaving the command line. Now, if we were so inclined, we could write a skll configuration file and run a collection of predictive models on the data from the command line. Conveniently, if our workflow involves additional command line operations or tools it’s easy to combine them with the code we’ve presented because Pandashells was designed to integrate well with existing command line tools.

I hope this post has given you ideas on how to use Pandashells to integrate Python, Pandas, and other data stack commands into your existing command line workflows. I’ve enjoyed using Pandashells for data merging and cleaning, quick ad-hoc analysis, and analysis and workflow proto-typing. I still use Python scripts when they’re more appropriate for the project, but it’s been a lot of fun performing common analysis tasks on the command line with Pandashells.

Foundations for Analytics with Python: From Non-programmer to Hacker

I’m excited to share that O’Reilly Media is about to publish my new book, Foundations for Analytics with Python: From Non-programmer to Hacker. The book is geared toward people who have no prior programming experience but deal with data every day and are interested in learning how to scale and automate their work.

Foundations for Analytics with Python by Clinton Brownley, PhD

I did not have a background in programming. I learned it on the job because I recognized it would enable me to automate repetitive actions and accomplish tasks that would be time-consuming or impossible with my current skill set. I read countless books, online tutorials, and blog posts in those first few weeks and months as I attempted to get my first program for work to do something useful for me. It’s difficult to fully describe how exhilarating and empowering it was when I finally got the program to work correctly. Needless to say, I was hooked, and I haven’t looked back.

I wrote the book with a few objectives in mind:

  • Be accessible to ambitious non-programmers
  • Be practical, so you can immediately see how you can use the code at work
  • Teach fundamental programming concepts and techniques and also provide alternative, shorter code that performs the same actions
  • Make the learning curve as short and shallow as possible so you can enjoy the fruits of your labor as quickly as possible

The book’s features reflect these objectives:

  • Each section focuses on one specific task, so you can learn how to accomplish that task without distractions
  • Each section is a complete, self-contained program, so you don’t have to remember to combine a bunch of code snippets to make them work
  • In the CSV and Excel chapters, each section of code has two versions, a base Python version and a Pandas version. The base Python version teaches you fundamental concepts and techniques. The Pandas version shortens and simplifies the code you need to write to accomplish the task
  • Uses the Anaconda Python 3 distribution, which bundles the newest version of Python with some of the most popular add-in packages
  • Includes screen shots of the input files, Python code, command line, and output files
  • Common data formats, including plain text, CSV, and Excel files, and databases
  • Common data processing tasks, including filtering for specific rows, selecting specific columns, and calculating summary statistics
  • Chapters on data analysis, plotting and graphing, and automation
  • Three real-world applications that illustrate how you can combine and extend techniques from earlier chapters to accomplish important data processing tasks
  • Both Windows and Mac commands and screen shots

To give you a feel for the book, let me provide a few sections of code from the book and the table of contents. The first section of code comes from the CSV chapter, the second section of code from the Excel chapter, and the third section of code from the Database chapter.  The brief comments after each section of code are for this blog post, they are not in the book.  If you want to see what other topics are included in the book, please see the table of contents at the bottom of this post.

Example Section #1: CSV Files

Reading and Writing a CSV File

Version #1: Base Python

#!/usr/bin/env python3
import csv
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, 'r', newline='') as csv_in_file:
    with open(output_file, 'w', newline='') as csv_out_file:
        filereader = csv.reader(csv_in_file, delimiter=',')
        filewriter = csv.writer(csv_out_file, delimiter=',')
        for row_list in filereader:
            filewriter.writerow(row_list)

Version #1 demonstrates how to read a CSV input file with base Python’s standard csv module and write the contents to a CSV output file. In the book, I explain every line of code. This first example gives you the ability to transfer all of your data to an output file. The subsequent examples in the chapter show you how to select specific data to write to the output file and how to process multiple CSV files.

Version #2: Pandas Add-in Module

#!/usr/bin/env python3
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
print(data_frame)
data_frame.to_csv(output_file, index=False)

Version #2 demonstrates how to accomplish the same task with Pandas. As you can see, you simply use read_csv to read the input file and to_csv to write to the output file.

Example Section #2: Excel Files

Reading and Writing an Excel Worksheet

Version #1: xlrd and xlwt Add-in Modules

#!/usr/bin/env python3
import sys
from xlrd import open_workbook
from xlwt import Workbook

input_file = sys.argv[1]
output_file = sys.argv[2]

output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('output_worksheet_name')

with open_workbook(input_file) as workbook:
    worksheet = workbook.sheet_by_name('input_worksheet_name')
    for row_index in range(worksheet.nrows):
        for column_index in range(worksheet.ncols):
            output_worksheet.write(row_index, column_index, \
                worksheet.cell_value(row_index, column_index))
output_workbook.save(output_file)

Version #1 demonstrates how to read and write an Excel worksheet with base Python and the xlrd and xlwt add-in modules. Again, this first example gives you the ability to transfer all of the data on one worksheet to an output file. The subsequent examples in the chapter show you how to select specific data to write to the output file, how to process multiple worksheets, and how to process multiple workbooks.

Version #2: Pandas Add-in Module

#!/usr/bin/env python3
import pandas as pd
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_excel(input_file, sheetname='input_worksheet_name')
writer = pd.ExcelWriter(output_file)
data_frame.to_excel(writer, sheet_name='output_worksheet_name', index=False)
writer.save()

Version #2 demonstrates how to accomplish the same task with Pandas. Again, you simply use read_excel to read the input worksheet and to_excel to write to the output worksheet.

Example Section #3: Databases

Query a table and write results to a file

#!/usr/bin/env python
import csv
import MySQLdb
import sys

output_file = sys.argv[1]

con = MySQLdb.connect(host='localhost', port=3306, db='my_suppliers', user='my_username', passwd='my_password')
c = con.cursor()

filewriter = csv.writer(open(output_file, 'wb'), delimiter=',')
header = ['Supplier Name','Invoice Number','Part Number','Cost','Purchase Date']
filewriter.writerow(header)

c.execute("""SELECT * FROM Suppliers WHERE Cost > 700.0;""")
rows = c.fetchall()
for row in rows:
    filewriter.writerow(row)

This example demonstrates how to connect to a database, query a table, and write the resulting data to a CSV output file. Other examples in the chapter explain how to load data into a database table from a file and update records in a table based on data in a file.

I hope these examples give you a feel for the book. If you want to see what other topics are included in the book, please see the table of contents shown below. Foundations for Analytics with Python is scheduled to be available in May 2016. Please keep an eye out for it, and if you know other people who may be interested please point them to this blog post and the Amazon link.  Thank you : )

 

TABLE OF CONTENTS

CHAPTER
Introduction
Why Read This Book/Why Learn These Skills
Who Is This Book For
Why Windows
Why Python
Install Anaconda Python
Text Editors
Download Book Materials
Base Python and Pandas
Overview of Chapters

CHAPTER
Python Basics
How To Create a Python Script
How To Run a Python Script
Numbers
Strings
Regular Expressions/Pattern Matching
Dates
Lists
Tuples
Dictionaries
Control Flow
Functions
Exceptions
Reading a Text File
Reading Multiple Text Files with Glob
Writing to a Text File
Writing to a Comma Separated Values “CSV” File
Print Statements

CHAPTER
Comma Separated Values “CSV” Text Files
Reading and Writing a CSV File (String Manipulation)
Reading and Writing a CSV File (Standard csv Module)
Filtering for Rows
    Value in Row Meets a Condition
    Value in Row is in a Set of Interest
    Value in Row Matches a Pattern (Regular Expression)
Selecting Columns
    Index Values
    Column Headings
Reading Multiple CSV Files
    Count Number of Files and Rows and Columns in Each File
    Concatenate Data From Multiple Files
    Sum and Average a Set of Values Per File
Selecting Contiguous Rows
Adding a Header Row

CHAPTER
Microsoft Excel Files
Introspecting an Excel Workbook
Reading a Single Worksheet
    Formatting Dates
    Filtering for Rows
        Value in Row Meets a Condition
        Value in Row is in a Set of Interest
        Value in Row Matches a Pattern (Regular Expression)
    Selecting Columns
        Index Values
        Column Headings
Reading All Worksheets
    Filtering for Rows from All Worksheets
    Selecting Columns from All Worksheets
Reading a Subset of Worksheets
    Filtering for Rows from Subset of Worksheets
    Selecting Columns from Subset of Worksheets
Reading Multiple Workbooks
    Count Number of Workbooks and Rows and Columns in Each Workbook
    Concatenate Data from Multiple Workbooks
    Sum and Average a Set of Values Per Worksheet Per Workbook

CHAPTER
Databases
Python’s Standard sqlite3 Module
    Create a Database
    Create a Database Table
    Insert Hand-written Data into a Database Table
    Query a Database Table
    Insert Data from a CSV File into a Database Table
    Update Records in a Database Table with Data from a CSV File
MySQL Database
    Create a Database
    Create a Database Table
    Insert Data from a CSV File into a Database Table
    Query a Database Table and Write Output to a CSV File
    Update Records in a Database Table with Data from a CSV File

CHAPTER
Applications
Find a Set of Items in a Large Collection of Excel and CSV Files
Parse a CSV File and Calculate a Statistic for Any Number of Categories
Parse a Text File and Calculate a Statistic for Any Number of Categories

CHAPTER
Graphing and Plotting
matplotlib
pandas
ggplot
seaborn

CHAPTER
Data Analysis
Descriptive statistics
Regression
Classification

CHAPTER
Automation
Windows: scheduled tasks
Mac: cron jobs

CHAPTER
Conclusion
Where To Go From Here
    Additional Built-Ins/Standard Modules
    Additional Add-In Modules
    Data Structures
How To Go From Here

APPENDIX
Downloads
Python
xlrd
mysqlclient/MySQL-python/MySQLdb
MySQL

Intro to Julia: Reading and Writing CSV Files with R, Python, and Julia

Last year I read yhat’s blog post, Neural networks and a dive into Julia, which provides an engaging introduction to Julia, a high-level, high-performance programming language for technical computing.

One aspect of the language I found intriguing was its aim to be as fast as C, as easy to use as Python, and as easy for statistics as R. I enjoyed seeing that Julia’s syntax is similar to Python, it has several graphing packages, including a ggplot2-inspired package called Gadfly, and it has a several structured data, statistics, and machine learning packages, including DataFrames for dealing with tabular data and StatsBase and MLBase that provide tools for statistics and machine learning operations.

There are lots of great resources for learning Julia. There are introductory books, like “Getting Started with Julia Programming,” by Ivo Balbaert, and “The Julia Express,” by Bogomil Kaminski. There are online tutorials, like Programming in Julia, Julia by Example, Learn Julia in Y minutes, and Learn Julia the Hard Way. There are also video tutorials, including two “Introduction to Julia” videos by David Sanders at SciPy 2014 and a set of ten Julia video tutorials recorded at MIT in 2013.

Since I’ve been using Python and R to analyze data, and Julia aspires to make the best features of these languages available in one place, I decided to try Julia to see if it would be worthwhile to incorporate it into my toolbox. One of the first things I wanted to learn was the new Julia syntax I’d need to use to perform the operations I’ve been carrying out in Python and R. Some of the most common operations I perform are reading text and delimited input files and writing results to output files. Since these are very common operations, let’s discuss how to perform these operations in R, Python, and Julia. In a later post we can discuss different ways to filter for specific rows and columns in these languages.

To begin, let’s create a folder to work in and name it “workspace”. Next, let’s download a publicly-available data set, e.g. wine-quality, into the folder. Let’s also create another folder called “output” inside the workspace folder where we can save the output files. At this point, we have the following set up:

folder_structure

R
Now that we have our workspace and an input file, let’s create R, Python, and Julia scripts to read the input data and write it to an output file. To create the R script, open a text editor and enter the following code:

#!/usr/bin/env Rscript
# For more information, visit: cbrownley.wordpress.com

#Collect the command line arguments into a variable called args
args <- commandArgs(trailingOnly = TRUE)
# Assign the first command line argument to a variable called input_file
input_file <- args[1]
# Assign the second command line argument to a variable called output_file
output_file <- args[2]

# Use R’s read.csv function to read the data into a variable called wine
# read.csv expects a CSV file with a header row, so
# sep = ',' and header = TRUE are default values
# stringsAsFactors = FALSE means don’t convert character vectors into factors
wine <- read.csv(input_file, sep = ',', header = TRUE, stringsAsFactors = FALSE)

# Use R’s write.csv function to write the data in the variable wine to the output file
# row.names = FALSE means don’t write an extra column of row names
# to the output file; we only want the original data columns
write.csv(wine, file = output_file, row.names = FALSE)

read_csv_R

Once you’ve pasted this code into the file, save the file as “read_csv.R” in the workspace folder and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.R
./read_csv.R winequality-red.csv output/output_R.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_R.csv in the output folder.

A popular R package for reading and managing data is the data.table package. To use the data.table package instead of base R in the script, all you would need to do is add one require statement and edit the line that reads the contents of the input file into a variable:

#!/usr/bin/env Rscript
require(data.table)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)

write.csv(wine, file = output_file, row.names = FALSE)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_table.R”, run the same chmod command on this file, and then substitute this R script in the last command shown above:
./read_csv_data_table.R winequality-red.csv output/output_R_data_table.csv

Python
Now let’s create a Python script to perform the same operations. To create the Python script, open a text editor and enter the following code:

#!/usr/bin/env python
# For more information, visit: cbrownley.wordpress.com

# Import Python's built-in csv and sys modules, which have functions
# for processing CSV files and command line arguments, respectively
import csv
import sys

# Assign the first command line argument to a variable called input_file
input_file = sys.argv[1]
# Assign the second command line argument to a variable called output_file
output_file = sys.argv[2]

# Open the input file for reading and close automatically at end
with open(input_file, 'rU') as csv_in_file:
    # Open the output file for writing and close automatically at end
    with open(output_file, 'wb') as csv_out_file:
        # Create a file reader object for reading all of the input data
        filereader = csv.reader(csv_in_file)
        # Create a file writer object for writing to the output file
        filewriter = csv.writer(csv_out_file)
        # Use a for loop to process the rows in the input file one-by-one
        for row in filereader:
            # Write the row of data to the output file
            filewriter.writerow(row)

read_csv_Python

Once you’ve pasted this code into the file, save the file as “read_csv.py” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.py
./read_csv.py winequality-red.csv output/output_Python.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Python.csv in the output folder.

A popular Python package for reading and managing tabular data is Pandas. Pandas provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the Pandas syntax is:

#!/usr/bin/env python
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
data_frame.to_csv(output_file, index=False)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_pandas.py”, run the same chmod command on this file, and then substitute this Python script in the last command shown above:
./read_csv_pandas.py winequality-red.csv output/output_Python_Pandas.csv

Julia
Now let’s create a Julia script to perform the same operations. To create the Julia script, open a text editor and enter the following code:

#!/usr/bin/env julia
# For more information, visit: cbrownley.wordpress.com

# Assign the first command line argument to a variable called input_file
input_file = ARGS[1]
# Assign the second command line argument to a variable called output_file
output_file = ARGS[2]

# Open the output file for writing
out_file = open(output_file, "w")
# Open the input file for reading and close automatically at end
open(input_file, "r") do in_file
    # Use a for loop to process the rows in the input file one-by-one
    for line in eachline(in_file)
        # Write the row of data to the output file
        write(out_file, line)
    # Close the for loop
    end
# Close the input file handle
end
# Close the output file handle
close(out_file)

read_csv_Julia

Once you’ve pasted this code into the file, save the file as “read_csv.jl” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.jl
./read_csv.jl winequality-red.csv output/output_Julia.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Julia.csv in the output folder.

A popular Julia package for reading and managing tabular data, especially when the data may contain NAs, is DataFrames. DataFrames provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the DataFrames syntax is:

#!/usr/bin/env julia
using DataFrames

input_file = ARGS[1]
output_file = ARGS[2]

data_frame = readtable(input_file, separator = ',')
writetable(output_file, data_frame)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_frames.jl”, run the same chmod command on this file, and then substitute this Julia script in the last command shown above:
./read_csv_data_frames.jl winequality-red.csv output/output_Julia_DataFrames.csv

folder_structure_all_files

As you can see, when it comes to reading, processing, and writing CSV files, the differences in syntax between Python and Julia are very slight. For example, Python’s “with open()” statements are “open() do … end” statements in Julia, and for loops in Julia drop the colon required in Python and instead require the end keyword. These differences are so minor that I’ve found it very easy to pick up Julia syntax and transition back and forth between Python and Julia.

Now that we know how to read and write all of the data in a CSV-formatted input file with R, Python, and Julia, the next step is to figure out how to filter for specific rows and columns in these languages. Then we can move on to processing lots of files in a directory and also dealing with Excel files. We’ll cover these topics in future posts.

To Read or Not to Read? Hopefully the Former!

It has been far too long since I posted to this blog.  It’s time to let you know what I’ve been working on over the past few months.  I’ve been writing a book on a topic that is near and dear to my heart.

The book is titled Multi-objective Decision Analysis: Managing Trade-offs and Uncertainty.  It is an applied, concise book that explains how to conduct multi-objective decision analyses using spreadsheets.

The book is scheduled to be published by Business Expert Press in 2013.  For a little more information about my forthcoming book, please read the abstract shown below:

“Whether managing strategy, operations, or products, making the best decision in a complex uncertain business environment is challenging.  One of the major difficulties facing decision makers is that they often have multiple, competing objectives, which means trade-offs will need to be made.  To further complicate matters, uncertainty in the business environment makes it hard to explicitly understand how different objectives will impact potential outcomes.  Fortunately, these problems can be solved with a structured framework for multi-objective decision analysis that measures trade-offs among objectives and incorporates uncertainties and risk preferences.

This book is designed to help decision makers by providing such an analysis framework implemented as a simple spreadsheet tool.  This framework helps structure the decision making process by identifying what information is needed in order to make the decision, defining how that information should be combined to make the decision and, finally, providing quantifiable evidence to clearly communicate and justify the final decision.

The process itself involves minimal overhead and is perfect for busy professionals who need a simple, structured process for making, tracking, and communicating decisions.  With this process, decision making is made more efficient by focusing only on information and factors that are well-defined, measureable, and relevant to the decision at hand.  The clear characterization of the decision required by the framework ensures that a decision can be traced and is consistent with the intended objectives and organizational values.  Using this structured decision-making framework, anyone can effectively and consistently make better decisions to gain a competitive and strategic advantage.”

Look for my forthcoming book, Multi-objective Decision Analysis, on the bookshelves in 2013!

Source: favim.com