In my previous post I mentioned that a coworker had recently emailed me a folder full of over two hundred Excel files and asked me to extract some relevant data from each file. I noted how undertaking that task manually would have been time-consuming and error-prone and described how exhilarating it was to accomplish the task quickly by writing some Python code.
I didn’t show how to process multiple files with Python in that post because it is easier to understand the code for processing multiple files once you’re familiar with code for processing one file. So that’s why in that post I demonstrated how to read and write a single CSV file with Python. With that knowledge under our belts, we’re now prepared to understand Python code for processing multiple CSV files.
One good way to learn to code in Python is to create small datasets on your laptop and then write a Python script to process or manipulate them in some way, so that’s what we’ll do here. The following example demonstrates one way to read multiple CSV files (with similarly formatted data), concatenate the data from the files, and write the results to an output file.
One assumption I make in this example is that you’ve already visited http://www.python.org/ and downloaded and installed the version of Python that is compatible with your computer’s operating system. Another assumption is that all of the input files are located in the same folder. Also, unlike in my previous post, the script in this example can handle commas embedded in column values because it imports Python’s built-in csv module, which makes it easier to handle numbers with embedded commas, e.g. $1,563.25.
Ok, in order to process multiple CSV files we need to create multiple CSV files. Open Microsoft Excel and add the following data:
Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “sales_january2014”. Click ‘Save’.
Ok, that’s one input file. Now let’s create a second input file. Open a new Excel workbook and add the following data:
Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “sales_february2014”. Click ‘Save’. Ok, now we have two CSV input files, one for January and one for February. We’ll stick with two input files in this example to keep it simple, but please keep in mind that the code in this example can handle many more files; that is, it will scale well.
Now that we have two CSV files to work with, let’s create a Python script to read the files and write their contents to an output file. Open your favorite text editor (e.g. Notepad) and add the following lines of code:
input_path = sys.argv
output_file = sys.argv
filewriter = csv.writer(open(output_file,’wb’))
file_counter = 0
for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
with open(input_file,’rU’) as csv_file:
filereader = csv.reader(csv_file)
if file_counter < 1:
for row in filereader:
header = next(filereader,None)
for row in filereader:
file_counter += 1
The first line is a comment line that makes the script transferable across operating systems. The next four lines import additional built-in Python modules so that we can use their methods and functions. You can read more about these and other built-in modules at: http://docs.python.org/2/library/index.html.
The sixth line uses argv from the sys module to grab the first piece of information after the script name on the command line, the path to and name of the input folder, and assigns it to the variable input_path. Similarly, the seventh line grabs the second piece of information after the script name, the path to and name of the output file, and assigns it to the variable output_file.
The eighth line uses the csv module to open the output file in write ‘w’ mode and create a writer object, filewriter, for writing to the output file. The ‘b’ enables a distinction between binary and text files for systems that differentiate between binary and text files, but for systems that do not, the ‘b’ has no effect. The ninth line creates a variable, file_counter, to store the count of the number of files processed and initializes it to zero.
The tenth line creates a list of the input files to be processed and also starts a “for” loop for looping through each of the input files. There is a lot going on in this one line, so let’s talk about how it works. os.path.join joins the two components between its parentheses. input_path is the path to the folder that contains the input files and ‘*.csv’ represents any file name that ends in ‘.csv’.
glob.glob expands the asterisk ‘*’, a Unix Shell wildcard character, in ‘*.csv’ into the actual file name. Together, glob.glob and os.path.join create a list of our two input files, e.g. [‘C:\Users\Clinton\Desktop\sales_january2014.csv’, ‘C:\Users\Clinton\Desktop\sales_february2014.csv’]. Finally, the “for” loop syntax executes the lines of code beneath this line for each of the input files in this list.
The eleventh line uses a “with” statement to open each input file in read ‘r’ mode. The ‘U’ mode helps recognize newlines in case your version of Python is built without universal newlines. The twelfth line uses the csv module to create a reader object, filereader, for reading each input file.
The thirteenth line creates an “if-else” statement for distinguishing between the first input file and all subsequent input files. The first time through the “for” loop file_counter equals zero, which is less than one, so the “if” block is executed. The code in the “if” block writes every row of data in the first input file, including the header row, to the output file.
At the bottom of the “for” loop, after processing the first input file, we add one to file_counter. Therefore, the second time through the “for” loop file_counter is not less than one, so the “else” block is executed. The code in the “else” block uses the csv module’s next() method to read the first row, i.e. the header row, of the second and subsequent input files into the variable, header, so that it is not written to the output file. The remaining code in the “else” block writes the remaining rows in the input file, the rows of data beneath the header row, to the output file.
Now that we understand what the code is supposed to do, let’s save this file as a Python script and use it to process our two input files. To save the file as a Python script, open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘All Files’ so that the dialog box doesn’t select a specific file type. Finally, in the ‘Save As’ or ‘File Name’ box, type process_many_csv_files.py. Click ‘Save’. Now you have a Python script you can use to process multiple CSV files.
To use process_many_csv_files.py to read and write the contents of our two input files, open a Command Prompt (Windows) or Terminal (Mac) window. When the window opens the prompt will be in a particular folder, also known as a directory (e.g. “C:\Users\Clinton\Documents”). The next step is to navigate to the Desktop, where we saved the Python script.
To move between folders, you can use the ‘cd’ command, a Unix command which stands for change directory. To move up and out of the ‘Documents’ folder into the ‘Clinton’ folder, type the following and then hit Enter:
That is, the letters ‘cd’ together followed by one space followed by two periods. The two periods ‘..’ stand for up one level. At this point, the prompt should look like “C:\Users\Clinton”. Now, to move down into a specific folder you use the same ‘cd’ command followed by the name of the folder you want to move into. Since the ‘Desktop’ folder resides in the ‘Clinton’ folder, you can move down into the ‘Desktop’ folder by typing the following and then hitting Enter:
At this point, the prompt should look like “C:\Users\Clinton\Desktop” in the Command Prompt and we are exactly where we need to be since this is where we saved the Python script and two CSV input files. The next step is to run the Python script.
To run the Python script, type one of the following commands on the command line, depending on your operating system, and then hit Enter:
python process_many_csv_files.py . sales_summary.csv
That is, type python, followed by a single space, followed by process_many_csv_files.py, followed by a single space, followed by a single period, followed by a single space, followed by sales_summary.csv, and then hit Enter. The single period refers to the current directory, your Desktop (i.e. the folder that contains your two input files).
chmod +x process_many_csv_files.py
./process_many_csv_files.py . sales_summary.csv
That is, type chmod, followed by a single space, followed by +x, followed by a single space, followed by process_many_csv_files.py, and then hit Enter. This command makes the Python script executable. Then type ./process_many_csv_files.py, followed by a single space, followed by a single period, followed by a single space, followed by sales_summary.csv, and then hit Enter:
After you hit Enter, you should not see any new output in the Command Prompt or Terminal window. However, if you minimize all of your open windows and look at your Desktop there should be a new CSV file called sales_summary.csv. Open the file. The contents should look like:
As you can see, a single header row and the six rows of data from the two input files were successfully written to the output file, sales_summary.csv. Often, this procedure of concatenating multiple input files into a single output file is all you need to do to begin your analysis. However, sometimes you may not need all of the rows or columns in the output file. Or you may need to modify the data or perform a calculation before writing it to the output file. In many cases, you would only need to make slight modifications to the code discussed above to alter the data written to your output file.
In this example there were only two input files, but the code generalizes to basically as many input files as your computer can handle. So if you need to concatenate the data in a few dozen, hundred, or thousand CSV files the code is basically re-useable as-is. This ability to automate and scale repetitive procedures on files and data is one of the great advantages of learning to code (even a little bit).
Now you know that it only takes a few lines of code to automate a process that, given a larger number of files, would be time-consuming and error-prone or even infeasible. Being able to quickly accomplish a task that would be difficult or impossible to do manually is empowering. Add in the benefit of eliminating manual “copy/paste” errors (once you debug your code), and the new capability is really exciting. Having read this post, I hope you’re now more familiar with Python and eager to begin using it. If you have any questions, please reply to this post.