A coworker recently emailed me a folder full of over two hundred Excel files and asked me to extract some relevant data from each file. A few years ago that task would have been daunting. Like many other business people, I would have had to manually open each of the files and extract (copy and paste) the relevant data into a new output file. With so many files, the undertaking would have required a lot of time and been prone to errors, even with careful attention to detail.
Thankfully, I had already learned how to code in Python, an easy-to-learn programming language that is great for processing files and manipulating data. With Python, I was able to write a short script that read each of the files, extracted the relevant data from each file, and wrote that data to an output file. The whole process took a couple of hours, compared to the days or weeks it would have taken to do it manually.
This type of capability, reducing the time it takes to complete a task from weeks to hours, is exhilarating. That’s one of the reasons why I’m going to start including “analytics” topics like this one in this blog. Another benefit of learning to do your work with Python is that the method scales. As I described in my story, as the number of files grows it simply becomes more and more difficult to complete the work in a timely fashion. Once you learn how to use Python to process and manipulate files and data, you won’t want to return to tiresome manual processes.
One of the ways I learned to code in Python was to create small datasets in a folder on my laptop and then write a Python script to process or manipulate them in some way. Since I found this method to be really helpful, that’s how I’m going to present the example in this blog post. Processing multiple files isn’t that complicated once you understand how all of the lines of code work together, but it does involve several concepts that are easier to understand once you know how to process a single file, so that’s where we’ll start.
The following example demonstrates one way to read a single comma separated values (CSV) file and write its contents to an output file. One assumption I make in this example is that you’ve already visited http://www.python.org/ and downloaded and installed the version of Python that is compatible with your computer’s operating system. Another assumption is that none of the values in the spreadsheet contains commas. Ok, in order to work with a CSV file we need to create a CSV file. Open Microsoft Excel and add the following data:
Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “customers”. Click ‘Save’. That’s all there is to it. Now you have a CSV file called customers.csv that you can read into Python.
Now that we have a CSV file to work with, let’s create a Python script to read the file and write its contents to an output file. Open your favorite text editor (e.g. Notepad) and add the following lines of code:
input_file = sys.argv
output_file = sys.argv
filewriter = open(output_file, ‘wb’)
with open(input_file, ‘rU’) as filereader:
for row in filereader:
row = row.strip()
row_list = row.split(‘,’)
The first line is a comment line that makes the script transferable across operating systems. The second line imports Python’s built-in string module so that we can use its methods and functions, e.g. strip() and split(), to manipulate strings. The third line imports Python’s built-in sys module so that we can supply additional information to the script on the command line in either the Command Prompt (Windows) or Terminal (Mac) window.
The fourth line uses argv from the sys module to grab the first piece of information after the script name on the command line, the path to and name of the input file, and assigns it to the variable input_file. Similarly, the fifth line grabs the second piece of information after the script name, the path to and name of the output file, and assigns it to the variable output_file.
The sixth line opens the output file in write ‘w’ mode and creates a writer object, filewriter, for writing to the output file. The ‘b’ enables a distinction between binary and text files for systems that differentiate between binary and text files, but for systems that do not, the ‘b’ has no effect.
The seventh line uses a “with” statement to open the input file in read ‘r’ mode and creates a reader object, filereader, for reading the input file. The ‘U’ mode helps recognize newlines in case your version of Python is built without universal newlines.
The eighth line uses a “for” loop to loop through the rows in the input file one by one. The next three lines of code occur for each row in the input file since they’re contained in the body of the “for” loop (you can tell they’re in the body of the loop because they’re indented). Each row of data in the input file enters the “for” loop as a string that contains column values separated by commas.
The ninth line uses the string module’s strip() function to remove whitespace, tabs, and newline characters from both ends of the string and then assigns that stripped version of the string to the variable row. The tenth line uses the string module’s split() function to split the string on the embedded commas into a list of column values and then assigns the list to the variable row_list.
The eleventh line uses the map() and str() functions to convert each of the values in row_list to a string. Then the join() function adds commas between the strings and joins all of the strings and commas together into a single string. Next, a newline character is added to the end of the string. Then the filewriter’s write() method writes the string to the output file. This process of stripping, splitting, manipulating, and writing occurs for each of the rows in the input file. Finally, in the twelfth line, the close() method closes the filewriter object.
Now that we understand what the code is supposed to do, let’s save this file as a Python script and use it to process the data in our input file, customers.csv. To save the file as a Python script, open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘All Files’ so that the dialog box doesn’t select a specific file type. Finally, in the ‘Save As’ or ‘File Name’ box, type process_csv_file.py. Click ‘Save’. Now you have a Python script you can use to read and write the contents of a CSV file.
To use process_csv_file.py to read and write the contents of customers.csv, open a Command Prompt (Windows) or Terminal (Mac) window. When the window opens the prompt will be in a particular folder, also known as a directory (e.g. “C:\Users\Clinton\Documents”). The next step is to navigate to the Desktop, where we saved the Python script.
To move between folders, you can use the ‘cd’ command, a Unix command which stands for change directory. To move up and out of the ‘Documents’ folder into the ‘Clinton’ folder, type the following and then hit Enter:
That is, the letters ‘cd’ together followed by one space followed by two periods. The two periods ‘..’ stand for up one level. At this point, the prompt should look like “C:\Users\Clinton”. Now, to move down into a specific folder you use the same ‘cd’ command followed by the name of the folder you want to move into. Since the ‘Desktop’ folder resides in the ‘Clinton’ folder, you can move down into the ‘Desktop’ folder by typing the following and then hitting Enter:
At this point, the prompt should look like “C:\Users\Clinton\Desktop” in the Command Prompt and we are exactly where we need to be since this is where we saved the Python script and CSV input file. The next step is to run the Python script.
To run the Python script, type one of the following commands on the command line, depending on your operating system, and then hit Enter:
python process_csv_file.py customers.csv my_output.csv
That is, type python, followed by a single space, followed by process_csv_file.py, followed by a single space, followed by customers.csv, followed by a single space, followed by my_output.csv, and then hit Enter:
chmod +x process_csv_file.py
./process_csv_file.py customers.csv my_output.csv
That is, type chmod, followed by a single space, followed by +x, followed by a single space, followed by process_csv_file.py, and then hit Enter. This command makes the Python script executable. Then type ./process_csv_file.py, followed by a single space, followed by customers.csv, followed by a single space, followed by my_output.csv, and then hit Enter:
After you hit Enter, you should not see any new output in the Command Prompt or Terminal window. However, if you minimize all of your open windows and look at your Desktop there should be a new CSV file called my_output.csv. Open the file. The contents should look like:
As you can see, the data in the input file, customers.csv, was successfully written to the output file, my_output.csv. This example demonstrates the basic process of reading a CSV file and writing data to an output file. The example is slightly unrealistic in that there isn’t any reason to re-write the data in the input file to an output file. However, it’s easy to imagine business logic that would make the output more interesting. Perhaps you only need the rows where the cost is greater than a specific threshold or where the purchase date is before a particular cut-off date. You would only need to make slight modifications to the code discussed above to add this business logic to your script.
This blog post covered processing a single CSV file, but one of the real advantages of learning to code in Python is the ability to automate and scale your processes to handle many files at a time. In a future post, I’ll cover modifications you can make to the code discussed in this post to create a Python script that processes multiple CSV files. Having read this post, I hope you’re now more familiar with Python and excited to start using it. If you have any questions, please reply to this post.