Intro to Julia: Reading and Writing CSV Files with R, Python, and Julia

Last year I read yhat’s blog post, Neural networks and a dive into Julia, which provides an engaging introduction to Julia, a high-level, high-performance programming language for technical computing.

One aspect of the language I found intriguing was its aim to be as fast as C, as easy to use as Python, and as easy for statistics as R. I enjoyed seeing that Julia’s syntax is similar to Python, it has several graphing packages, including a ggplot2-inspired package called Gadfly, and it has a several structured data, statistics, and machine learning packages, including DataFrames for dealing with tabular data and StatsBase and MLBase that provide tools for statistics and machine learning operations.

There are lots of great resources for learning Julia. There are introductory books, like “Getting Started with Julia Programming,” by Ivo Balbaert, and “The Julia Express,” by Bogomil Kaminski. There are online tutorials, like Programming in Julia, Julia by Example, Learn Julia in Y minutes, and Learn Julia the Hard Way. There are also video tutorials, including two “Introduction to Julia” videos by David Sanders at SciPy 2014 and a set of ten Julia video tutorials recorded at MIT in 2013.

Since I’ve been using Python and R to analyze data, and Julia aspires to make the best features of these languages available in one place, I decided to try Julia to see if it would be worthwhile to incorporate it into my toolbox. One of the first things I wanted to learn was the new Julia syntax I’d need to use to perform the operations I’ve been carrying out in Python and R. Some of the most common operations I perform are reading text and delimited input files and writing results to output files. Since these are very common operations, let’s discuss how to perform these operations in R, Python, and Julia. In a later post we can discuss different ways to filter for specific rows and columns in these languages.

To begin, let’s create a folder to work in and name it “workspace”. Next, let’s download a publicly-available data set, e.g. wine-quality, into the folder. Let’s also create another folder called “output” inside the workspace folder where we can save the output files. At this point, we have the following set up:

folder_structure

R
Now that we have our workspace and an input file, let’s create R, Python, and Julia scripts to read the input data and write it to an output file. To create the R script, open a text editor and enter the following code:

#!/usr/bin/env Rscript
# For more information, visit: cbrownley.wordpress.com

#Collect the command line arguments into a variable called args
args <- commandArgs(trailingOnly = TRUE)
# Assign the first command line argument to a variable called input_file
input_file <- args[1]
# Assign the second command line argument to a variable called output_file
output_file <- args[2]

# Use R’s read.csv function to read the data into a variable called wine
# read.csv expects a CSV file with a header row, so
# sep = ',' and header = TRUE are default values
# stringsAsFactors = FALSE means don’t convert character vectors into factors
wine <- read.csv(input_file, sep = ',', header = TRUE, stringsAsFactors = FALSE)

# Use R’s write.csv function to write the data in the variable wine to the output file
# row.names = FALSE means don’t write an extra column of row names
# to the output file; we only want the original data columns
write.csv(wine, file = output_file, row.names = FALSE)

read_csv_R

Once you’ve pasted this code into the file, save the file as “read_csv.R” in the workspace folder and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.R
./read_csv.R winequality-red.csv output/output_R.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_R.csv in the output folder.

A popular R package for reading and managing data is the data.table package. To use the data.table package instead of base R in the script, all you would need to do is add one require statement and edit the line that reads the contents of the input file into a variable:

#!/usr/bin/env Rscript
require(data.table)

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]

wine <- fread(input_file)

write.csv(wine, file = output_file, row.names = FALSE)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_table.R”, run the same chmod command on this file, and then substitute this R script in the last command shown above:
./read_csv_data_table.R winequality-red.csv output/output_R_data_table.csv

Python
Now let’s create a Python script to perform the same operations. To create the Python script, open a text editor and enter the following code:

#!/usr/bin/env python
# For more information, visit: cbrownley.wordpress.com

# Import Python's built-in csv and sys modules, which have functions
# for processing CSV files and command line arguments, respectively
import csv
import sys

# Assign the first command line argument to a variable called input_file
input_file = sys.argv[1]
# Assign the second command line argument to a variable called output_file
output_file = sys.argv[2]

# Open the input file for reading and close automatically at end
with open(input_file, 'rU') as csv_in_file:
    # Open the output file for writing and close automatically at end
    with open(output_file, 'wb') as csv_out_file:
        # Create a file reader object for reading all of the input data
        filereader = csv.reader(csv_in_file)
        # Create a file writer object for writing to the output file
        filewriter = csv.writer(csv_out_file)
        # Use a for loop to process the rows in the input file one-by-one
        for row in filereader:
            # Write the row of data to the output file
            filewriter.writerow(row)

read_csv_Python

Once you’ve pasted this code into the file, save the file as “read_csv.py” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.py
./read_csv.py winequality-red.csv output/output_Python.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Python.csv in the output folder.

A popular Python package for reading and managing tabular data is Pandas. Pandas provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the Pandas syntax is:

#!/usr/bin/env python
import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data_frame = pd.read_csv(input_file)
data_frame.to_csv(output_file, index=False)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_pandas.py”, run the same chmod command on this file, and then substitute this Python script in the last command shown above:
./read_csv_pandas.py winequality-red.csv output/output_Python_Pandas.csv

Julia
Now let’s create a Julia script to perform the same operations. To create the Julia script, open a text editor and enter the following code:

#!/usr/bin/env julia
# For more information, visit: cbrownley.wordpress.com

# Assign the first command line argument to a variable called input_file
input_file = ARGS[1]
# Assign the second command line argument to a variable called output_file
output_file = ARGS[2]

# Open the output file for writing
out_file = open(output_file, "w")
# Open the input file for reading and close automatically at end
open(input_file, "r") do in_file
    # Use a for loop to process the rows in the input file one-by-one
    for line in eachline(in_file)
        # Write the row of data to the output file
        write(out_file, line)
    # Close the for loop
    end
# Close the input file handle
end
# Close the output file handle
close(out_file)

read_csv_Julia

Once you’ve pasted this code into the file, save the file as “read_csv.jl” and close the file. You can run this script by typing the following two commands on the command line, hitting Enter after each one:
chmod +x read_csv.jl
./read_csv.jl winequality-red.csv output/output_Julia.csv

When you run the script you won’t see any output printed to the screen, but the input data was written to a file called output_Julia.csv in the output folder.

A popular Julia package for reading and managing tabular data, especially when the data may contain NAs, is DataFrames. DataFrames provides many helpful functions, a couple of which simplify the syntax needed to read and write CSV files. For example, to perform the same reading and writing operations we performed above, the DataFrames syntax is:

#!/usr/bin/env julia
using DataFrames

input_file = ARGS[1]
output_file = ARGS[2]

data_frame = readtable(input_file, separator = ',')
writetable(output_file, data_frame)

To use this script instead of the first version, all you would need to do is save the file, e.g. as “read_csv_data_frames.jl”, run the same chmod command on this file, and then substitute this Julia script in the last command shown above:
./read_csv_data_frames.jl winequality-red.csv output/output_Julia_DataFrames.csv

folder_structure_all_files

As you can see, when it comes to reading, processing, and writing CSV files, the differences in syntax between Python and Julia are very slight. For example, Python’s “with open()” statements are “open() do … end” statements in Julia, and for loops in Julia drop the colon required in Python and instead require the end keyword. These differences are so minor that I’ve found it very easy to pick up Julia syntax and transition back and forth between Python and Julia.

Now that we know how to read and write all of the data in a CSV-formatted input file with R, Python, and Julia, the next step is to figure out how to filter for specific rows and columns in these languages. Then we can move on to processing lots of files in a directory and also dealing with Excel files. We’ll cover these topics in future posts.

Advertisements

To Read or Not to Read? Hopefully the Former!

It has been far too long since I posted to this blog.  It’s time to let you know what I’ve been working on over the past few months.  I’ve been writing a book on a topic that is near and dear to my heart.

The book is titled Multi-objective Decision Analysis: Managing Trade-offs and Uncertainty.  It is an applied, concise book that explains how to conduct multi-objective decision analyses using spreadsheets.

The book is scheduled to be published by Business Expert Press in 2013.  For a little more information about my forthcoming book, please read the abstract shown below:

“Whether managing strategy, operations, or products, making the best decision in a complex uncertain business environment is challenging.  One of the major difficulties facing decision makers is that they often have multiple, competing objectives, which means trade-offs will need to be made.  To further complicate matters, uncertainty in the business environment makes it hard to explicitly understand how different objectives will impact potential outcomes.  Fortunately, these problems can be solved with a structured framework for multi-objective decision analysis that measures trade-offs among objectives and incorporates uncertainties and risk preferences.

This book is designed to help decision makers by providing such an analysis framework implemented as a simple spreadsheet tool.  This framework helps structure the decision making process by identifying what information is needed in order to make the decision, defining how that information should be combined to make the decision and, finally, providing quantifiable evidence to clearly communicate and justify the final decision.

The process itself involves minimal overhead and is perfect for busy professionals who need a simple, structured process for making, tracking, and communicating decisions.  With this process, decision making is made more efficient by focusing only on information and factors that are well-defined, measureable, and relevant to the decision at hand.  The clear characterization of the decision required by the framework ensures that a decision can be traced and is consistent with the intended objectives and organizational values.  Using this structured decision-making framework, anyone can effectively and consistently make better decisions to gain a competitive and strategic advantage.”

Look for my forthcoming book, Multi-objective Decision Analysis, on the bookshelves in 2013!

Source: favim.com