The Two Rs of Python: Reading and Writing CSV Files

A coworker recently emailed me a folder full of over two hundred Excel files and asked me to extract some relevant data from each file.  A few years ago that task would have been daunting.  Like many other business people, I would have had to manually open each of the files and extract (copy and paste) the relevant data into a new output file.  With so many files, the undertaking would have required a lot of time and been prone to errors, even with careful attention to detail.

Thankfully, I had already learned how to code in Python, an easy-to-learn programming language that is great for processing files and manipulating data.  With Python, I was able to write a short script that read each of the files, extracted the relevant data from each file, and wrote that data to an output file.  The whole process took a couple of hours, compared to the days or weeks it would have taken to do it manually.

This type of capability, reducing the time it takes to complete a task from weeks to hours, is exhilarating.   That’s one of the reasons why I’m going to start including “analytics” topics like this one in this blog.  Another benefit of learning to do your work with Python is that the method scales.  As I described in my story, as the number of files grows it simply becomes more and more difficult to complete the work in a timely fashion.  Once you learn how to use Python to process and manipulate files and data, you won’t want to return to tiresome manual processes.

One of the ways I learned to code in Python was to create small datasets in a folder on my laptop and then write a Python script to process or manipulate them in some way.  Since I found this method to be really helpful, that’s how I’m going to present the example in this blog post.  Processing multiple files isn’t that complicated once you understand how all of the lines of code work together, but it does involve several concepts that are easier to understand once you know how to process a single file, so that’s where we’ll start.

The following example demonstrates one way to read a single comma separated values (CSV) file and write its contents to an output file.  One assumption I make in this example is that you’ve already visited http://www.python.org/ and downloaded and installed the version of Python that is compatible with your computer’s operating system.  Another assumption is that none of the values in the spreadsheet contains commas.  Ok, in order to work with a CSV file we need to create a CSV file.  Open Microsoft Excel and add the following data:

Figure I: Input File - customers.csv

Figure I: Input File – customers.csv

Now open the ‘Save As’ dialog box.  In the location box, navigate to your Desktop so the file will be saved on your Desktop.  In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format.  Finally, in the ‘Save As’ or ‘File Name’ box, type “customers”.  Click ‘Save’.  That’s all there is to it.  Now you have a CSV file called customers.csv that you can read into Python.

Now that we have a CSV file to work with, let’s create a Python script to read the file and write its contents to an output file.  Open your favorite text editor (e.g. Notepad) and add the following lines of code:

#!/usr/bin/python
import string
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

filewriter = open(output_file, ‘wb’)

with open(input_file, ‘rU’) as filereader:
        for row in filereader:
                row = row.strip()
                row_list = row.split(‘,’)
                filewriter.write(‘,’.join(map(str,row_list))+’\n’)
filewriter.close()

The first line is a comment line that makes the script transferable across operating systems.  The second line imports Python’s built-in string module so that we can use its methods and functions, e.g. strip() and split(), to manipulate strings.  The third line imports Python’s built-in sys module so that we can supply additional information to the script on the command line in either the Command Prompt (Windows) or Terminal (Mac) window.

The fourth line uses argv from the sys module to grab the first piece of information after the script name on the command line, the path to and name of the input file, and assigns it to the variable input_file.  Similarly, the fifth line grabs the second piece of information after the script name, the path to and name of the output file, and assigns it to the variable output_file.

The sixth line opens the output file in write ‘w’ mode and creates a writer object, filewriter, for writing to the output file.  The ‘b’ enables a distinction between binary and text files for systems that differentiate between binary and text files, but for systems that do not, the ‘b’ has no effect.

The seventh line uses a “with” statement to open the input file in read ‘r’ mode and creates a reader object, filereader, for reading the input file.  The ‘U’ mode helps recognize newlines in case your version of Python is built without universal newlines.

The eighth line uses a “for” loop to loop through the rows in the input file one by one.  The next three lines of code occur for each row in the input file since they’re contained in the body of the “for” loop (you can tell they’re in the body of the loop because they’re indented).  Each row of data in the input file enters the “for” loop as a string that contains column values separated by commas.

The ninth line uses the string module’s strip() function to remove whitespace, tabs, and newline characters from both ends of the string and then assigns that stripped version of the string to the variable row.  The tenth line uses the string module’s split() function to split the string on the embedded commas into a list of column values and then assigns the list to the variable row_list.

The eleventh line uses the map() and str() functions to convert each of the values in row_list to a string.  Then the join() function adds commas between the strings and joins all of the strings and commas together into a single string.  Next, a newline character is added to the end of the string.  Then the filewriter’s write() method writes the string to the output file.  This process of stripping, splitting, manipulating, and writing occurs for each of the rows in the input file.  Finally, in the twelfth line, the close() method closes the filewriter object.

Now that we understand what the code is supposed to do, let’s save this file as a Python script and use it to process the data in our input file, customers.csv.  To save the file as a Python script, open the ‘Save As’ dialog box.  In the location box, navigate to your Desktop so the file will be saved on your Desktop.  In the format box, select ‘All Files’ so that the dialog box doesn’t select a specific file type.  Finally, in the ‘Save As’ or ‘File Name’ box, type process_csv_file.py.  Click ‘Save’.  Now you have a Python script you can use to read and write the contents of a CSV file.

Figure II: Python Script - process_csv_file.py

Figure II: Python Script – process_csv_file.py

To use process_csv_file.py to read and write the contents of customers.csv, open a Command Prompt (Windows) or Terminal (Mac) window.  When the window opens the prompt will be in a particular folder, also known as a directory (e.g. “C:\Users\Clinton\Documents”).  The next step is to navigate to the Desktop, where we saved the Python script.

To move between folders, you can use the ‘cd’ command, a Unix command which stands for change directory.  To move up and out of the ‘Documents’ folder into the ‘Clinton’ folder, type the following and then hit Enter:

cd ..

That is, the letters ‘cd’ together followed by one space followed by two periods.  The two periods ‘..’ stand for up one level.  At this point, the prompt should look like “C:\Users\Clinton”.  Now, to move down into a specific folder you use the same ‘cd’ command followed by the name of the folder you want to move into.  Since the ‘Desktop’ folder resides in the ‘Clinton’ folder, you can move down into the ‘Desktop’ folder by typing the following and then hitting Enter:

cd Desktop

At this point, the prompt should look like “C:\Users\Clinton\Desktop” in the Command Prompt and we are exactly where we need to be since this is where we saved the Python script and CSV input file.  The next step is to run the Python script.

To run the Python script, type one of the following commands on the command line, depending on your operating system, and then hit Enter:

Windows:

python process_csv_file.py customers.csv my_output.csv

That is, type python, followed by a single space, followed by process_csv_file.py, followed by a single space, followed by customers.csv, followed by a single space, followed by my_output.csv, and then hit Enter:

Mac:

chmod +x process_csv_file.py

./process_csv_file.py customers.csv my_output.csv

That is, type chmod, followed by a single space, followed by +x, followed by a single space, followed by process_csv_file.py, and then hit Enter.  This command makes the Python script executable.  Then type ./process_csv_file.py, followed by a single space, followed by customers.csv, followed by a single space, followed by my_output.csv, and then hit Enter:

After you hit Enter, you should not see any new output in the Command Prompt or Terminal window.  However, if you minimize all of your open windows and look at your Desktop there should be a new CSV file called my_output.csv.  Open the file.  The contents should look like:

Figure III: Output file - my_output.csv

Figure III: Output File – my_output.csv

As you can see, the data in the input file, customers.csv, was successfully written to the output file, my_output.csv.  This example demonstrates the basic process of reading a CSV file and writing data to an output file.  The example is slightly unrealistic in that there isn’t any reason to re-write the data in the input file to an output file.  However, it’s easy to imagine business logic that would make the output more interesting.  Perhaps you only need the rows where the cost is greater than a specific threshold or where the purchase date is before a particular cut-off date.  You would only need to make slight modifications to the code discussed above to add this business logic to your script.

This blog post covered processing a single CSV file, but one of the real advantages of learning to code in Python is the ability to automate and scale your processes to handle many files at a time.  In a future post, I’ll cover modifications you can make to the code discussed in this post to create a Python script that processes multiple CSV files.  Having read this post, I hope you’re now more familiar with Python and excited to start using it.  If you have any questions, please reply to this post.

Advertisements

7 thoughts on “The Two Rs of Python: Reading and Writing CSV Files

  1. Thank you for you very detailed instruction. I love it!
    I have follow-up question.
    #1. After I strip and slit the data from file 1, I want to (let say) sum col 1 + col 2 (imagine that they are numbers) of the file 1. Then open file 2, grab a col 1 from this file, then write the previous sum, and this col 1 to an output file. What should I do?
    #2. If I want to append the column in file 2 into file 1
    #3. If I have too many column in file 1 and I just want a subset of them into the output files.
    Thank you for your helps.

    • Hi Trung,

      Thank you for your questions. For #1, try the following code with two CSV input files. The code sums the first two columns in the first input file and writes the column of sums and the first column from the second input file to the output file. It also prints the information being written to the output file to the command prompt / terminal window. I’ll try to answer your second two questions shortly.

      Sincerely,
      Clinton

      #!/usr/bin/python
      import csv
      import glob
      import os
      import sys

      input_path = sys.argv[1]
      output_file = sys.argv[2]

      filewriter = open(output_file,’wb’)
      file_counter = 0
      my_sum = [ ]
      one_column = [ ]
      for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
          print input_file
          with open(input_file,’rU’) as csv_file:
              filereader = csv.reader(csv_file)
              if file_counter < 1:
                  row_counter = 0
                  for row in filereader:
                      if row_counter == 0:
                          pass
                      else:
                          # Add columns 1 and 2 as floating-point (i.e. decimal) numbers
                          sum = float(row[0])+float(row[1])
                          print row[0], row[1], sum
                          my_sum.append(sum)
                          # Write my_sum to the output file
                          #filewriter.write(str(my_sum)+','+'\n')
                      row_counter += 1
              else:
                  row_counter = 0
                  for row in filereader:
                      if row_counter == 0:
                          pass
                      else:
                          print row[0]
                          one_column.append(row[0])
                          # Write column 1 from the second file
                          #filewriter.write(str(row[0])+','+'\n')
                      row_counter += 1
          file_counter += 1

      # Write the two columns to the output file
      for x in range(len(my_sum)):
          filewriter.write(str(my_sum[x])+','+str(one_column[x]+'\n'))
      filewriter.close()

    • Thanks so much for your feedback. I haven’t written another blog post in a while, but that’s because I’ve been working on an introductory book on data processing and analysis in Python. It will be published by O’Reilly later this summer/fall. Then I can get back to the blog posts. Thanks again for taking the time to share your comments.

  2. Hi Clinton,
    Thanks for the post! I’ve been looking for something like this to answer my question. So, in your example, you join the values back together after taking them apart. What I’d like to do is iterate across each value and paste it into a text. For example, I want to start with a list of dates, and put them into a statement. Let’s say I have 3 dates in a csv file (one column), and i want the output to read “My birthday is [date1], my anniversary is [date2], and my dog’s birthday is [date3].”

    How would I approach that? It seems like an iteration of what you’re doing here.

    Thanks!
    Anna

    • Hi Anna,

      I created a CSV file with the following input in the first three rows and columns:
      1-Jan-15 15-Jan-15 30-Jan-15
      1-Feb-15 15-Feb-15 28-Feb-15
      1-Mar-15 15-Mar-15 30-Mar-15

      The following code prints three sentences to the screen and also writes the sentences to a text file. To run the script, I used the following on the command line: ./date_sentence.py dates.csv sentences.txt

      The output printed to the screen and written to the output file, sentences.txt, is:
      My birthday is 1-Jan-15, my anniversary is 15-Jan-15, and my dog’s birthday is 30-Jan-15.
      My birthday is 1-Feb-15, my anniversary is 15-Feb-15, and my dog’s birthday is 28-Feb-15.
      My birthday is 1-Mar-15, my anniversary is 15-Mar-15, and my dog’s birthday is 30-Mar-15.

      The code is:
      #!/usr/bin/env python
      import string
      import sys

      input_file = sys.argv[1]
      output_file = sys.argv[2]

      filewriter = open(output_file, ‘wb’)

      with open(input_file, ‘rU’) as filereader:
          for row in filereader:
              row = row.strip()
              row_list = row.split(‘,’)
              date1 = str(row_list[0])
              date2 = str(row_list[1])
              date3 = str(row_list[2])
              print “My birthday is %s, my anniversary is %s, and my dog’s birthday is %s.” % (date1, date2, date3)
              filewriter.write(“My birthday is %s, my anniversary is %s, and my dog’s birthday is %s.\n” % (date1, date2, date3))
      filewriter.close()

      Please let me know if this isn’t quite what you’re looking for.

      Sincerely,
      Clinton

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s