Simplifying scikit-learn Predictive Modeling with skll APIs and Configuration Files

One blog post I really enjoyed reading last year is yhat’s post, Predicting customer churn with scikit-learn. In the post, the author demonstrates how to estimate, cross-validate, and measure the performance of three predictive models for classification in Python with the scikit-learn package. At the time, I was more familiar with how to do predictive modeling in R with the caret package, as described in the well-written book Applied Predictive Modeling. However, since I also knew some Python and was interested in learning how to use it for predictive modeling, yhat’s post was exactly what I needed to get started.

More recently, I read and followed along with the examples in Jereon Janssens’s excellent book, Data Science at the Command Line. In Chapter 9, the author demonstrates how to use another Python package, the SciKit-Learn Laboratory (a.k.a. skll) package, to implement predictive modeling with scikit-learn from the command line. The skll package is very helpful because it provides an API interface to scikit-learn that simplifies the code you need to implement your predictive models. Additionally, it also enables you to specify your modeling parameters in a configuration file so you can run your models from the command line.

Since I’d become familiar with how to implement predictive modeling with scikit-learn, I really enjoyed learning to use skll’s API and configuration file interfaces. To see how skll’s interfaces simplify the process of using the scikit-learn package to implement predictive modeling, let’s estimate, cross-validate, and measure the performance of four predictive models with scikit-learn and then perform similar operations with skll’s interfaces.

DATA SET

The data set we’ll use in this post comes from the publicly available wine quality data sets, which are available here. There is a file for red wines and a file for white wines. The data set used in this post is these two files concatenated together (with only one header row). The dependent variable in the original research took on integer values representing wine quality, so the data set lends itself to regression tasks. However, in this post we’ll demonstrate how to perform classification tasks by classifying wines as either red or white. You can download the data and prepare the final data set by entering the following commands in a Terminal window:

# Download two CSV files, winequality-red.csv and winequality-white.csv
# from the UCI Machine Learning Repository
# and save the files as wine-red.csv and wine-white.csv
parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-{}.csv > wine-{}.csv" ::: red white

# Check that you now have wine-red.csv and wine-white.csv
# in your current working directory
find . -maxdepth 1 -name "wine*"

# For each file, convert uppercase to lowercase,
# convert semicolons to commas,
# convert spaces to underscores,
# remove double quotes, and
# add a column that says 'red' or 'white' depending on the file
for T in red white; do < wine-$T.csv tr '[A-Z]; ' '[a-z],_' | tr -d \" | sed "s/$/,${T}/" > wine-${T}-clean.csv; done

# Review the first five lines in each file to ensure the changes are correct
head -n 5 wine-{red,white}-clean.csv | fold

# Concatenate the red and white wine files into one file, wine-both-clean.csv
# Retain only one column heading
# Name the last column 'type' since it contains the words 'red' and 'white'
head -n 1 wine-red-clean.csv > wine-both-clean.csv; grep -v quality wine-red-clean.csv >> wine-both-clean.csv; grep -v quality wine-white-clean.csv >> wine-both-clean.csv; sed -i -e "1s/,red/,type/" wine-both-clean.csv

# Review the row counts for red (1,599) and white (4,898) wines
parallel --tag "grep -c {} wine-both-clean.csv" ::: red white

Now that we have our data set, wine-both-clean.csv, let’s move on to estimating our predictive models with scikit-learn.

SCIKIT-LEARN

In order to understand the ways in which skll simplifies the syntax needed to perform predictive modeling with the scikit-learn package, let’s first estimate, cross-validate, and measure the performance of four predictive models with scikit-learn. To do so, copy and paste the following code into a text editor and save the file as classify_wine_scikit_learn.py:

#!/usr/bin/env python
import sys
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LogR
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

input_file = sys.argv[1]

# Read the wine quality data into a Pandas data frame
wine_data_frame = pd.read_csv(input_file)

# Create a red wine binary variable named 'y' for classification
wine_type = wine_data_frame['type']
y = np.where(wine_type == 'red', 1., 0.)

# Specify X, the matrix of predictor variables
features = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
        'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
        'ph', 'sulphates', 'alcohol']
wine_features = wine_data_frame[features]
X = wine_features.as_matrix().astype(np.float)
print "\nFeature space holds %d observations and %d features" % (X.shape)
print "\nAccuracy if you predict all 0s (baseline or benchmark): %0.3f" % (accuracy_score(y, [0 for value in y.tolist()]))
print "\n*****************************************\n"

# Center and scale the predictor variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Specify a function for k-fold cross-validation
def run_cv(X, y, clf, **kwargs):
    kfold = KFold(len(y), n_folds=10, shuffle=True, random_state=123456789)
    print str(clf)
    y_pred = y.copy()
    from time import time
    t0 = time()
    fold = 0
    for train_index, test_index in kfold:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf.fit(X_train, y_train)
        y_pred[test_index] = clf.predict(X_test)
        fold += 1
        print "Finished fold:", str(fold)
    print "Cross-validation took %0.2f seconds." % (time() - t0)
    return y_pred

# K NEAREST NEIGHBORS
# Estimate predicted values for k-nearest-neighbors classification model with cross-validation
y_pred_knn = run_cv(X, y, KNN(n_neighbors=6))

# Calculate performance metrics for k-nearest-neighbors classification model
print "\nK-nearest-neighbors (accuracy): " + "%.3f" % accuracy_score(y, y_pred_knn)
print "\nKNN confusion matrix: "
print(confusion_matrix(y_pred_knn, y))
target_names = ['White (0)', 'Red (1)']
print "\nKNN classification report: "
print(classification_report(y, y_pred_knn, target_names=target_names))
print "\n*****************************************\n"

# LOGISTIC REGRESSION
# Estimate predicted values for logistic/logit classification model with cross-validation
y_pred_logr = run_cv(X, y, LogR(random_state=123456789))

# Calculate performance metrics for logistic/logit classification model
print "\nLogistic (accuracy): " + "%.3f" % accuracy_score(y, y_pred_logr)
print "\nLogistic confusion matrix: "
print(confusion_matrix(y_pred_logr, y))
target_names = ['White (0)', 'Red (1)']
print "\nLogistic classification report: "
print(classification_report(y, y_pred_logr, target_names=target_names))
print "\n*****************************************\n"

# RANDOM FOREST
# Estimate predicted values for random forest classification model with cross-validation
y_pred_rf = run_cv(X, y, RFC(n_estimators=500, random_state=123456789))

# Calculate performance metrics for random forest classification model
print "\nRandom forest (accuracy): " + "%.3f" % accuracy_score(y, y_pred_rf)
print "\nRF confusion matrix: "
print(confusion_matrix(y_pred_rf, y))
target_names = ['White (0)', 'Red (1)']
print "\nRF classification report: "
print(classification_report(y, y_pred_rf, target_names=target_names))
print "\n*****************************************\n"

# SUPPORT VECTOR MACHINES
# Specify grid parameters for the support vector machines classifier (SVC)
# Reference: http://scikit-learn.org/0.11/tutorial/statistical_inference/model_selection.html#grid-search
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
gamma_values = [0.01, 0.1, 1.0, 10.0, 100.0]
param_grid = dict(kernel=['rbf'], gamma=gamma_values, C=C_values)
svc_gscv = GridSearchCV(SVC(cache_size=1000, class_weight='auto', random_state=123456789), param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Estimate predicted values for support vector machines classification model with cross-validation
y_pred_svc = run_cv(X, y, svc_gscv)

# Calculate performance metrics for support vector machines classification model
print "\nSupport vector machines (accuracy): " + "%.3f" % accuracy_score(y, y_pred_svc)
print "\nSVM confusion matrix: "
print(confusion_matrix(y_pred_svc, y))
target_names = ['White (0)', 'Red (1)']
print "\nSVM classification report: "
print(classification_report(y, y_pred_svc, target_names=target_names))

This code is similar to the code discussed in yhat’s post, Predicting customer churn with scikit-learn, so instead of describing the lines of code in detail I’ll summarize the sections of code. At the top, we import all of the functions we’re going to use to manage and transform the data, estimate the models, perform k-fold cross-validation, and evaluate the performance of the models.

Next, we read the data into a Pandas data frame and create a new binary variable named ‘y’ that equals 1.0 for red wines and 0.0 for white wines. Next, we specify which variables are going to serve as the independent, explanatory, predictor variables and transform them into a matrix. At this point, we print the number of rows and columns in the analysis data set and print the baseline, benchmark accuracy we achieve by naïvely predicting all of the observations to be zeros (i.e. the majority of the observations are zeros). Hopefully, the performance of our predictive models will surpass this benchmark performance!

Next, we center and scale the predictor variables so they all have a mean of zero and a standard deviation of one. These transformations help ensure that the learning algorithms and predictive performance of the statistical models aren’t influenced by the predictor variables’ units of measurement.

Next, we define a function for performing k-fold cross-validation. The function specifies a particular random state to be consistent with skll’s default random state. It also includes a few print statements so you can see the folds being processed and see how long the cross-validation takes for different models. You can find an example of stratified k-fold cross-validation, a related procedure that attempts to balance the percentage of observations from each class in each fold, in Bugra’s blog post, An Introduction to Supervised Learning via Scikit Learn. This post also demonstrates how to produce plots of training and test errors, confusion matrices, and variable importance scores.

Finally, we run cross-validation with four predictive models, (1) k-nearest neighbors, (2) logistic regression, (3) random forest, and (4) support vector machines and report on their performance with an accuracy score, a confusion matrix, and a classification report. The confusion matrix shows the number of correctly classified observations along the main diagonal and the classification report includes information on precision, recall, f1-score, and support for each category.

One final point is that the support vector machine section demonstrates how to use grid search to select optimal parameters for the model. To be consistent with skll’s grid search defaults, I instructed grid search to use 3-fold cross-validation, a particular random state, and accuracy to determine the optimal values for C and gamma. This 3-fold cross-validation takes place within the 10-fold cross-validation taking place to measure the model’s predictive performance.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_scikit_learn.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_scikit_learn.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

scikit-learn script output

As you can see, you can achieve an accuracy score of 0.75 by naïvely predicting all of the observations to be zero. Since the rest of the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.18 seconds

K-Nearest Neighbors
Accuracy: 0.994
Processing time: 0.50 seconds

Random Forest
Accuracy: 0.995
Processing time: 36.44 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 225.42 seconds

The output shows that, in this example, all of the models achieve similar accuracy scores. It also shows that logistic regression completes 3 times faster than k-nearest-neighbors and 1,252 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with scikit-learn, let’s take a look at how to do so with skll’s APIs to scikit-learn.

SKLL API

If you want to follow along with the next two sections you’ll need to install skll. Here are instructions for doing so: skll There can be compatibility issues between scikit-learn and skll, depending on the versions you have installed. To follow along with this post, make sure you have scikit-learn==0.15.2 and skll==1.0.1. You can check which versions you have by typing the following on the command line and then hitting Enter: pip freeze. Once you’ve installed skll, copy and paste the following code into a text editor and save the file as classify_wine_skll.py:

#!/usr/bin/env python
import sys
import time
from skll.data.readers import Reader
from skll.learner import Learner

input_file = sys.argv[1]

file_reader = Reader.for_path(input_file, label_col='type')

training_data = file_reader.read()

number_of_folds = 10

def average_accuracy(fold_results):
    import numpy as np
    accuracy = []
    for fold_index in range(number_of_folds):
        accuracy.append(fold_result_list[fold_index][1])
    return np.mean(accuracy)

print "\nLogistic:"
logistic_learner = Learner('LogisticRegression', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nKNN:"
knn_learner = Learner('KNeighborsClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = knn_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nRF:"
rf_learner = Learner('RandomForestClassifier', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = logistic_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=False, shuffle=True)
print "%d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

print "\nSVC:"
svc_learner = Learner('SVC', probability=False, feature_scaling=u'both')
t0 = time.time()
fold_result_list, grid_search_scores = svc_learner.cross_validate(training_data, stratified=True, \
cv_folds=number_of_folds, grid_search=True, grid_search_folds=3, grid_objective=u'f1_score_micro', \
param_grid=[{'C':[0.01, 0.1, 1.0, 10.0, 100.0]}, {'gamma':[0.01, 0.1, 1.0, 10.0, 100.0]}], shuffle=True)
print "Grid search %d-fold cross-validation took %0.2f seconds" % (number_of_folds, time.time() - t0)
print "Results for each fold:"
print fold_result_list
print "Grid search scores, if used:"
print grid_search_scores
print "Average accuracy: %.3f" % average_accuracy(fold_result_list)

One of the first things you’ll notice is how much less code we need to run the same cross-validation exercise we ran in the previous section. At the top, we import skll’s Reader and Learner objects, which we’ll use to read in the training data and to specify the statistical learning models we want to cross-validate, respectively.

You’ll notice that when we use the Reader object to create a file reader object we specify the name of the column that’s to be used as the dependent, ‘y’ variable. In this example, we specify that the label column is ‘type’ because it contains the class labels. We also create a variable for the number of cross-validation folds so the code is a little more flexible and you can experiment with 5-fold, 10-fold, or 20-fold cross-validation more easily.

The next block of code defines a function to calculate the average accuracy of the model based on cross-validation. We create this simple helper function because, as described in the API’s documentation, the output of cross-validation is a pair of lists. The first list contains the confusion matrix, overall accuracy, per-label precision, recall, and F-measures, and model parameters for each fold. The second list contains the grid search scores, if any, for each fold. This helper function collects all of the accuracy scores from the cross-validation output and returns the average accuracy across all of the folds.

The next four blocks of code initialize the four statistical learning models, cross-validate the models, and report their results. For example, the first learner is logistic regression. We’re not interested in calculating probabilities, so we set probability equal to False. To be consistent with the centering and scaling we did to the predictor variables in base scikit-learn, we set feature scaling to both.

Next, we perform cross-validation with the model. In this case, we set stratified equal to True to use stratified k-fold cross-validation so each fold contains nearly the same number of red and white wines. Since we set number_of_folds to 10 and cv_folds equals number_of_folds, we’re performing 10-fold cross-validation. For logistic regression, we’re not employing grid search, and we set shuffle equal to True so the observations are shuffled before they’re split into folds for cross-validation. Some of these options are the defaults, and some of them are changed to be consistent with the scikit-learn code we used above.

The k-nearest neighbors and random forest sections are identical to the logistic regression section, except for employing different predictive models. To be consistent with the scikit-learn code we used above, the support vector machines section contains some additional code to employ grid search for optimal C and gamma values.

Finally, we print how long cross-validation takes to complete, all of the results for each of the folds, the grid search scores (if any), and the average accuracy across all of the folds.

Make the script executable by typing the following on the command line and then hitting Enter:
chmod +x classify_wine_skll.py

Run the script by typing the following on the command line and then hitting Enter:
./classify_wine_skll.py wine-both-clean.csv

When you run the script you should see the following output printed to your Terminal window:

skll script output

Since the output runs off of the screen, here are the accuracy scores and processing times for the four predictive models:

Logistic Regression
Accuracy: 0.994
Processing time: 0.85 seconds

K-Nearest Neighbors
Accuracy: 0.993
Processing time: 1.22 seconds

Random Forest
Accuracy: 0.994
Processing time: 0.88 seconds

Support Vector Machines
Accuracy: 0.996
Processing time: 57.60 seconds

Once again the output shows that, in this example, all of the models achieve similar accuracy scores. In this case, logistic regression completes 1.4 times faster than k-nearest-neighbors and 68 times faster than support vector machines.

Now that we’ve seen how to estimate, cross-validate, and measure the performance of four predictive models in a Python script with skll’s APIs, let’s take a look at how to do so on the command line with skll’s configuration file.

SKLL CONFIGURATION FILE

As I mentioned at the top of this post, Jereon Janssens demonstrates how to use skll’s configuration file set-up in his book, Data Science at the Command Line. You can also read skll’s own tutorial for using the configuration file set-up here.

skll’s configuration file set-up requires an input file to be formatted slightly differently than the way ours is now, so we need to modify our input file. Specifically, we need (1) an additional row index column with unique numbers for each row, (2) the wine type column to contain the floating-point numbers 1.0 and 0.0 instead of the strings red and white, and (3) the file should only contain the binary ‘type’ column and the predictor variables we want to use (i.e. it shouldn’t contain additional variables we don’t intend to use). Since our input file contains the ‘quality’ column and we don’t want to use it as a predictor we need to remove it from the file.

You can create the modified version of the input file by entering the following one-line, piped command in a Terminal window:

< wine-both-clean.csv nl -s, -w4 -n rz -v0 | sed 's/0000,/id,/' | sed 's/red/1\./' | sed 's/white/0\./' | cut -d, -f1-12,14 > wine-both-clean-ids.csv

The nl command adds line numbers to each of the rows. -s, says add a comma after the line number (since we’re using a CSV file). -w4 says make the line numbers four digits wide (we know the input file contains 6,498 rows, so four digits wide will be sufficient). -n rz says insert line numbers according to the rz format, which is right-justified with leading zeros. -v0 says start the line numbering at 0, or in this case, 0000.

Next, the first sed command changes the line number in the first line from 0000, to the column heading id,. The second sed command changes the word red into ‘1.’, and the third sed command changes the word white into ‘0.’. The cut command separates the file into columns based on the comma delimiter and selects, or keeps, columns one to twelve and column fourteen (thereby removing the ‘quality’ variable in column thirteen). The result of these operations is redirected to a new output file called wine-both-clean-ids.csv. We’ll use this new file as our input file.

Now that our input file is ready to be processed, let’s create our skll configuration file. To do so, copy and paste the following code into a text editor and save the file as classify_wine_skll.cfg (Note that the file extension is .cfg instead of .py):

[General]
    experiment_name = Wine
    task = cross_validate

[Input]
    train_directory = .
    featuresets = [["wine-both-clean-ids.csv"]]
    learners = ["RandomForestClassifier", "SVC", "KNeighborsClassifier", "LogisticRegression"]
    label_col = type
    id_col = id
    shuffle = True
    feature_scaling = both

[Tuning]
    objective = accuracy
    grid_search = True
    param_grids = [[], [{'C': [0.01, 0.1, 1.0, 10.0, 100.0],'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}], [], []]

[Output]
    log = results
    results = results
    predictions = results

skll’s configuration file contains four sections, General, Input, Tuning, and Output. You can learn more about each of these sections here. In the General section, we specify a name for the experiment. All of the output file names will start with the word we use. We also specify the task we want to perform, which is cross_validate.

In the Input section, we specify the train_directory, the folder where the configuration file can find the training data set. The period, ‘.’, is shorthand for the current folder. If you save the input file in a different folder, then you’ll have to supply the folder name (e.g. training_data or my_input_files). featuresets contains the name of the input file(s) that contain your features, i.e. explanatory variables. In this case they’re all in one file but, as skll’s tutorial demonstrates, they can be spread across multiple input files. learners is a list of the learners we want to use. label_col indicates which column contains the class labels. id_col indicates which column contains the unique row numbers. Once again, to be consistent with the preceding examples, shuffle equals True and feature_scaling equals both.

In the Tuning section, we specify we want to use accuracy as our objective. Since we want to use grid search for one of our models, we set grid_search to True. When grid search is True, you have to supply a list of lists to param_grids, one list for each model. A list can be empty if you don’t want to perform grid search for the model, but you still need a set of square brackets for each model. If you do want to perform grid search for a model, then you supply a dictionary of the parameters and values you want to search over and optimize. Since we’re performing grid search for the support vector machines model we supply ‘C’ and ‘gamma’ as two keys and lists of values to search over as the values associated with each key.

Finally, in the Output section, we specify the name of the folder we want all of the output to be saved in. In this case, all of the logs, results, and predictions will be saved in a folder named ‘results’ inside our current folder.

Run this configuration file (a.k.a. experiment) by typing the following on the command line and then hitting Enter:
run_experiment classify_wine_skll.cfg

After you hit Enter, you’ll see the following output printed to your command prompt window after all four of the models have completed:

Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done
Loading ./wine-both-clean-ids.csv… done

In addition, several output files have been written in the ‘results’ folder inside your current folder. You can cd and/or ls into the results folder to view the output. skll creates four files for each model: a log file, a predictions file, a results file, and a results file formatted as JSON. skll also creates a summary file that contains details on each of the models and each of the cross-validation folds.

One of the measures we’ve been using to compare the models is average accuracy. This value is available in the summary file. Jereon Janssens demonstrates how to access and print this value using some helpful command line tools in his book, but you can also print it out with basic Unix commands. To view the average accuracy for each of the models, type the following command on the command line and then hit Enter (assuming you named the output folder ‘results’ and it’s inside your current folder):

grep average results/Wine_summary.tsv | cut -f1,13 | awk -F\t '{ print $2 ": " $1 }'

The grep command filters for rows in the tab-delimited summary file that contain the word average. The cut command separates the columns in the file based on tabs, which we didn’t have to specify because it’s the default in cut. Then we select the first column (i.e. the average accuracy score) and the thirteenth column (i.e. the name of the classifier). Finally, the awk command re-arranges the two pieces of information so that what’s printed to the screen is the name of the classifier, followed by a colon and a space, and then the average accuracy score. When you run this command you’ll see the following output printed to the Terminal window:

skll configuration file output

Once again, all of the models have similar accuracy scores.

As we’ve seen, skll’s APIs and configuration file set-up encapsulate and simplify a lot of the basic scikit-learn code you need to read input data, transform variables, and estimate, cross-validate, and measure the performance of predictive models. Now that you’re familiar with skll’s general interfaces and syntax, try modifying the code to use your own input data, estimate different models, measure performance with a different metric, or perform other tasks, like predict or evaluate instead of cross_validate. Also be sure to check out the additional resources noted throughout this post for supplementary explanations and examples. Have a great time experimenting with skll’s interfaces!

Advertisements

Intro to Julia: Reading and Writing CSV Files with R, Python, and Julia

Last year I read yhat’s blog post, Neural networks and a dive into Julia, which provides an engaging introduction to Julia, a high-level, high-performance programming language for technical computing.

One aspect of the language I found intriguing was its aim to be as fast as C, as easy to use as Python, and as easy for statistics as R. I enjoyed seeing that Julia’s syntax is similar to Python, it has several graphing packages, including a ggplot2-inspired package called Gadfly, and it has a several structured data, statistics, and machine learning packages, including DataFrames for dealing with tabular data and StatsBase and MLBase that provide tools for statistics and machine learning operations.

There are lots of great resources for learning Julia. There are introductory books, like “Getting Started with Julia Programming,” by Ivo Balbaert, and “The Julia Express,” by Bogomil Kaminski. There are online tutorials, like Programming in Julia, Julia by Example, Learn Julia in Y minutes, and Learn Julia the Hard Way. There are also video tutorials, including two “Introduction to Julia” videos by David Sanders at SciPy 2014 and a set of ten Julia video tutorials recorded at MIT in 2013.

Since I’ve been using Python and R to analyze data, and Julia aspires to make the best features of these languages available in one place, I decided to try Julia to see if it would be worthwhile to incorporate it into my toolbox. One of the first things I wanted to learn was the new Julia syntax I’d need to use to perform the operations I’ve been carrying out in Python and R. Some of the most common operations I perform are reading text and delimited input files and writing results to output files. Since these are very common operations, let’s discuss how to perform these operations in R, Python, and Julia. In a later post we can discuss different ways to filter for specific rows and columns in these languages.

To begin, let’s create a folder to work in and name it “workspace”. Next, let’s download a publicly-available data set, e.g. wine-quality, into the folder. Let’s also create another folder called “output” inside the workspace folder where we can save the output files. At this point, we have the following set up:

folder_structure

R
Now that we have our workspace and an input file, let’s create R, Python, and Julia scripts to read the input data and write it to an output file. To create the R script, open a text editor and enter the following code:

#!/usr/bin/env Rscript
# For more information, visit: cbrownley.wordpress.com

#Collect the command line arguments into a variable called args
args <- commandArgs(trailingOnly = TRUE)
# Assign the first command line argument to a variable called input_file
input_file <- args[1]
# Assign the second command line argument to a variable called output_file
output_file <- args[2]

# Use R’s read.csv function to read the data into a variable called wine
# read.csv expects a CSV file with a header row, so
# sep = ',' and header = TRUE are default values
# stringsAsFactors = FALSE means don’t convert character vectors into factors
wine <- read.csv(input_file, sep = ',', header = TRUE, stringsAsFactors = FALSE)

# Use R’s write.csv function to write the data in the variable wine to the output file
# row.names = FALSE means don’t write an extra column of row names
# to the output file; we only want the original data columns
write.csv(wine, file = output_file, row.names = FALSE)

read_csv_R

Once you’ve pasted this code into the file, save the file as “read_csv.R” in the workspace folder and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.R
./read_csv.R winequality-red.csv output/output_R.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_R.csv in the output folder.

A popular R package for reading and managing data is the data.table package. To use the data.table package instead of base R in the script, all you would need to do is add one require statement and edit the line that reads the contents of the input file into a variable:

#!/usr/bin/env Rscript
require(data.table)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)

write.csv(wine, file = output_file, row.names = FALSE)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_table.R”, run the same chmod command on this file, and then substitute this R script in the last command shown above:
./read_csv_data_table.R winequality-red.csv output/output_R_data_table.csv

Python
Now let’s create a Python script to perform the same operations. To create the Python script, open a text editor and enter the following code:

#!/usr/bin/env python
# For more information, visit: cbrownley.wordpress.com

# Import Python's built-in csv and sys modules, which have functions
# for processing CSV files and command line arguments, respectively
import csv
import sys

# Assign the first command line argument to a variable called input_file
input_file = sys.argv[1]
# Assign the second command line argument to a variable called output_file
output_file = sys.argv[2]

# Open the input file for reading and close automatically at end
with open(input_file, 'rU') as csv_in_file:
    # Open the output file for writing and close automatically at end
    with open(output_file, 'wb') as csv_out_file:
        # Create a file reader object for reading all of the input data
        filereader = csv.reader(csv_in_file)
        # Create a file writer object for writing to the output file
        filewriter = csv.writer(csv_out_file)
        # Use a for loop to process the rows in the input file one-by-one
        for row in filereader:
            # Write the row of data to the output file
            filewriter.writerow(row)

read_csv_Python

Once you’ve pasted this code into the file, save the file as “read_csv.py” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.py
./read_csv.py winequality-red.csv output/output_Python.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Python.csv in the output folder.

A popular Python package for reading and managing tabular data is Pandas. Pandas provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the Pandas syntax is:

#!/usr/bin/env python
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
data_frame.to_csv(output_file, index=False)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_pandas.py”, run the same chmod command on this file, and then substitute this Python script in the last command shown above:
./read_csv_pandas.py winequality-red.csv output/output_Python_Pandas.csv

Julia
Now let’s create a Julia script to perform the same operations. To create the Julia script, open a text editor and enter the following code:

#!/usr/bin/env julia
# For more information, visit: cbrownley.wordpress.com

# Assign the first command line argument to a variable called input_file
input_file = ARGS[1]
# Assign the second command line argument to a variable called output_file
output_file = ARGS[2]

# Open the output file for writing
out_file = open(output_file, "w")
# Open the input file for reading and close automatically at end
open(input_file, "r") do in_file
    # Use a for loop to process the rows in the input file one-by-one
    for line in eachline(in_file)
        # Write the row of data to the output file
        write(out_file, line)
    # Close the for loop
    end
# Close the input file handle
end
# Close the output file handle
close(out_file)

read_csv_Julia

Once you’ve pasted this code into the file, save the file as “read_csv.jl” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.jl
./read_csv.jl winequality-red.csv output/output_Julia.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Julia.csv in the output folder.

A popular Julia package for reading and managing tabular data, especially when the data may contain NAs, is DataFrames. DataFrames provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the DataFrames syntax is:

#!/usr/bin/env julia
using DataFrames

input_file = ARGS[1]
output_file = ARGS[2]

data_frame = readtable(input_file, separator = ',')
writetable(output_file, data_frame)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_frames.jl”, run the same chmod command on this file, and then substitute this Julia script in the last command shown above:
./read_csv_data_frames.jl winequality-red.csv output/output_Julia_DataFrames.csv

folder_structure_all_files

As you can see, when it comes to reading, processing, and writing CSV files, the differences in syntax between Python and Julia are very slight. For example, Python’s “with open()” statements are “open() do … end” statements in Julia, and for loops in Julia drop the colon required in Python and instead require the end keyword. These differences are so minor that I’ve found it very easy to pick up Julia syntax and transition back and forth between Python and Julia.

Now that we know how to read and write all of the data in a CSV-formatted input file with R, Python, and Julia, the next step is to figure out how to filter for specific rows and columns in these languages. Then we can move on to processing lots of files in a directory and also dealing with Excel files. We’ll cover these topics in future posts.

Don’t Whine Just Because You Can’t Drink Your Wine and Have It Too

A few weeks ago, I visited a charming wine and beer tasting room, located in a small wooden cottage, in Santa Barbara, CA.  The face of the cottage is flat; except for a small wooden porch at the entryway that has a few steps on either side to invite visitors up from the sidewalk below.  On each side of the entryway, about chest-high, there is one square, multi-paned window.  And in the center, like an elongated nose above a smile on a person’s face, is the entryway, a pair of thick wooden double doors left wide open to welcome all passersby.

Image Source: http://www.yelp.com

Inside the cottage there is a single rectangular room.  On the left-hand side of the room there is a long waist-high bar that stretches about ¾ of the length of the room.  Below the bar, on the customers’ side, is a row of wooden chairs.  Above the bar, hanging from the ceiling, is a row of four lights evenly spaced along the length of the bar.  In the back of the room there is a warm fire-place, some comfortable chairs, and several tall wooden wine racks filled with colorful wine bottles.  Finally, on the right-hand side of the room, there are some small round waist-high tables that customers can use to rest their drinking glasses and some rectangular tables that display complementary products like meats and cheeses, cookbooks, and bottle openers.

Ordering a Flight, Receiving a Chance to Upgrade

My friends and I huddled in a space at the bar and each ordered a wine tasting.  As the bartender began to fill my first glass, she used up the last of the wine in the bottle.  At that point she told me about an interesting challenge – involving both risks and rewards – that they offer to whoever receives the last of the wine from a bottle.

The challenge is as follows – I am given one chance to toss the bottle’s cork into a giant wine glass (the circumference around the opening of the glass is probably similar to that of a basketball) located behind the bar at one end of the bar.  I will receive a reward for successfully tossing the cork into the giant wine glass, and the value of my reward is proportional to my distance from the giant wine glass.

How Many Light Bulbs Does It Take To…?

Remember the four hanging lights evenly spaced along the length of the bar?  The first light is almost directly across the bar from the giant wine glass.  The second light is a bit farther away.  The third light is even farther away.  And the fourth light, almost at the other end of the bar, is farthest away.  If I toss the cork from the first light and it goes in, I win one free tasting.  If I toss the cork from the second light, I win two free tasting.  From the third light, three free tastings.  And from the fourth light, four free tastings.  As you can see, my reward increases as my distance from the giant wine glass increases.

At the same time, the likelihood of me actually tossing the cork successfully into the giant wine glass drops precipitously as I step back from one light to the next.  Let’s say (optimistically) I have a 25 percent chance of making the shot from the first light.  That means I think I can make (on average) one out of every four shots from that location.  Again, for sake of discussion, let’s say I have a 1 percent chance of making the shot from the fourth light, meaning I think I can make one out of every one hundred shots from that location.  If I assume the likelihood of me making a shot decreases linearly, then from the second light I have a 17 percent chance of making the shot and from the third light I have a 9 percent chance.1  As you can see, my risk increases (i.e. the likelihood of a successful toss decreases) as my distance from the giant wine glass increases.

Based on all of this information, from which light should I toss the cork?  Take a moment and think about the situation.  How would you analyze the risks and rewards?  From which light would you toss the cork?

Image Source: apbrwww5.apsu.edu

If You Disregard Expected Value, Instead of a Flight You May Get Taken for a Ride – But It’s Your Fault, So You Can’t Whine

One way to make a decision under uncertainty is to calculate the expected value of your alternatives and select the alternative with the highest expected value.  In my situation, I have four alternatives (i.e. tossing the cork from one of the four lights).  The expected value of tossing the cork from the first light is 0.25.2  The expected value of tossing the cork from the second light is 0.34.  The expected value from the third light is 0.27.  And the expected value from the fourth light is 0.04.  Based on the expected values of the alternatives, I should toss the cork from the second light because it gives me the largest expected payout.

So, did I make my decision based on expected value and toss the cork from the second light?  Of course not.  I was there with three friends, which means there were four of us all together.  My friends were either looking at me with cute puppy dog eyes or cheering me on to encourage me to toss the cork from the fourth light so that all four of us would receive free tastings.  I took a moment to visualize a successful toss from the fourth light and then let the cork fly.  How did it turn out?  All I’ll say is, “Next time I’m going to pay a bit more attention to the expected value of my alternatives.”

1First light: 25 percent chance; Fourth light: 1 percent chance

(Distance between first and fourth) / (Number of segments) = Distance between segments

(25 – 1) / 3 = 8

Second light: 25 – 8 = 17 percent chance

Third light: 17 – 8 = 9 percent chance

2EV = Sum (Probability * Value of Outcome) over the entire chance event.  In this case, “value of outcome” equals the number of free tastings.

EVFirstLight = (0.25*1) + (0.75*0) = 0.25

EVSecondLight = (0.17*2) + (0.83*0) = 0.34

EVThirdLight = (0.09*3) + (0.91*0) = 0.27

EVFourthLight = (0.01*4) + (0.99*0) = 0.04