Too Many Files? Feed Them to Python’s Voracious Glob Module

In my previous post I mentioned that a coworker had recently emailed me a folder full of over two hundred Excel files and asked me to extract some relevant data from each file. I noted how undertaking that task manually would have been time-consuming and error-prone and described how exhilarating it was to accomplish the task quickly by writing some Python code.

I didn’t show how to process multiple files with Python in that post because it is easier to understand the code for processing multiple files once you’re familiar with code for processing one file. So that’s why in that post I demonstrated how to read and write a single CSV file with Python. With that knowledge under our belts, we’re now prepared to understand Python code for processing multiple CSV files.

One good way to learn to code in Python is to create small datasets on your laptop and then write a Python script to process or manipulate them in some way, so that’s what we’ll do here. The following example demonstrates one way to read multiple CSV files (with similarly formatted data), concatenate the data from the files, and write the results to an output file.

One assumption I make in this example is that you’ve already visited http://www.python.org/ and downloaded and installed the version of Python that is compatible with your computer’s operating system. Another assumption is that all of the input files are located in the same folder. Also, unlike in my previous post, the script in this example can handle commas embedded in column values because it imports Python’s built-in csv module, which makes it easier to handle numbers with embedded commas, e.g. $1,563.25.

Ok, in order to process multiple CSV files we need to create multiple CSV files. Open Microsoft Excel and add the following data:

Figure I: 1st Input File - sales_january2014.csv

Figure I: 1st Input File – sales_january2014.csv

Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “sales_january2014”. Click ‘Save’.

Ok, that’s one input file. Now let’s create a second input file. Open a new Excel workbook and add the following data:

Figure II: 2nd Input File - sales_february2014.csv

Figure II: 2nd Input File – sales_february2014.csv

Now open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘Comma Separated Values (.csv)’ so that Excel saves the file in the CSV format. Finally, in the ‘Save As’ or ‘File Name’ box, type “sales_february2014”. Click ‘Save’. Ok, now we have two CSV input files, one for January and one for February. We’ll stick with two input files in this example to keep it simple, but please keep in mind that the code in this example can handle many more files; that is, it will scale well.

Now that we have two CSV files to work with, let’s create a Python script to read the files and write their contents to an output file. Open your favorite text editor (e.g. Notepad) and add the following lines of code:

#!/usr/bin/python
import csv
import glob
import os
import sys

input_path = sys.argv[1]
output_file = sys.argv[2]

filewriter = csv.writer(open(output_file,’wb’))
file_counter = 0
for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
        with open(input_file,’rU’) as csv_file:
                filereader = csv.reader(csv_file)
                if file_counter < 1:
                        for row in filereader:
                                filewriter.writerow(row)
                else:
                        header = next(filereader,None)
                        for row in filereader:
                                filewriter.writerow(row)
        file_counter += 1

The first line is a comment line that makes the script transferable across operating systems. The next four lines import additional built-in Python modules so that we can use their methods and functions. You can read more about these and other built-in modules at: http://docs.python.org/2/library/index.html.

The sixth line uses argv from the sys module to grab the first piece of information after the script name on the command line, the path to and name of the input folder, and assigns it to the variable input_path. Similarly, the seventh line grabs the second piece of information after the script name, the path to and name of the output file, and assigns it to the variable output_file.

The eighth line uses the csv module to open the output file in write ‘w’ mode and create a writer object, filewriter, for writing to the output file. The ‘b’ enables a distinction between binary and text files for systems that differentiate between binary and text files, but for systems that do not, the ‘b’ has no effect. The ninth line creates a variable, file_counter, to store the count of the number of files processed and initializes it to zero.

The tenth line creates a list of the input files to be processed and also starts a “for” loop for looping through each of the input files. There is a lot going on in this one line, so let’s talk about how it works. os.path.join joins the two components between its parentheses. input_path is the path to the folder that contains the input files and ‘*.csv’ represents any file name that ends in ‘.csv’.

glob.glob expands the asterisk ‘*’, a Unix Shell wildcard character, in ‘*.csv’ into the actual file name. Together, glob.glob and os.path.join create a list of our two input files, e.g. [‘C:\Users\Clinton\Desktop\sales_january2014.csv’, ‘C:\Users\Clinton\Desktop\sales_february2014.csv’]. Finally, the “for” loop syntax executes the lines of code beneath this line for each of the input files in this list.

The eleventh line uses a “with” statement to open each input file in read ‘r’ mode. The ‘U’ mode helps recognize newlines in case your version of Python is built without universal newlines. The twelfth line uses the csv module to create a reader object, filereader, for reading each input file.

The thirteenth line creates an “if-else” statement for distinguishing between the first input file and all subsequent input files. The first time through the “for” loop file_counter equals zero, which is less than one, so the “if” block is executed. The code in the “if” block writes every row of data in the first input file, including the header row, to the output file.

At the bottom of the “for” loop, after processing the first input file, we add one to file_counter. Therefore, the second time through the “for” loop file_counter is not less than one, so the “else” block is executed. The code in the “else” block uses the csv module’s next() method to read the first row, i.e. the header row, of the second and subsequent input files into the variable, header, so that it is not written to the output file. The remaining code in the “else” block writes the remaining rows in the input file, the rows of data beneath the header row, to the output file.

Now that we understand what the code is supposed to do, let’s save this file as a Python script and use it to process our two input files. To save the file as a Python script, open the ‘Save As’ dialog box. In the location box, navigate to your Desktop so the file will be saved on your Desktop. In the format box, select ‘All Files’ so that the dialog box doesn’t select a specific file type. Finally, in the ‘Save As’ or ‘File Name’ box, type process_many_csv_files.py. Click ‘Save’. Now you have a Python script you can use to process multiple CSV files.

Figure III: Python Script - process_many_csv_files.py

Figure III: Python Script – process_many_csv_files.py

To use process_many_csv_files.py to read and write the contents of our two input files, open a Command Prompt (Windows) or Terminal (Mac) window. When the window opens the prompt will be in a particular folder, also known as a directory (e.g. “C:\Users\Clinton\Documents”). The next step is to navigate to the Desktop, where we saved the Python script.

To move between folders, you can use the ‘cd’ command, a Unix command which stands for change directory. To move up and out of the ‘Documents’ folder into the ‘Clinton’ folder, type the following and then hit Enter:

cd ..

That is, the letters ‘cd’ together followed by one space followed by two periods. The two periods ‘..’ stand for up one level. At this point, the prompt should look like “C:\Users\Clinton”. Now, to move down into a specific folder you use the same ‘cd’ command followed by the name of the folder you want to move into. Since the ‘Desktop’ folder resides in the ‘Clinton’ folder, you can move down into the ‘Desktop’ folder by typing the following and then hitting Enter:

cd Desktop

At this point, the prompt should look like “C:\Users\Clinton\Desktop” in the Command Prompt and we are exactly where we need to be since this is where we saved the Python script and two CSV input files. The next step is to run the Python script.

To run the Python script, type one of the following commands on the command line, depending on your operating system, and then hit Enter:

Windows:
python process_many_csv_files.py . sales_summary.csv

That is, type python, followed by a single space, followed by process_many_csv_files.py, followed by a single space, followed by a single period, followed by a single space, followed by sales_summary.csv, and then hit Enter. The single period refers to the current directory, your Desktop (i.e. the folder that contains your two input files).

Mac:
chmod +x process_many_csv_files.py
./process_many_csv_files.py . sales_summary.csv

That is, type chmod, followed by a single space, followed by +x, followed by a single space, followed by process_many_csv_files.py, and then hit Enter. This command makes the Python script executable. Then type ./process_many_csv_files.py, followed by a single space, followed by a single period, followed by a single space, followed by sales_summary.csv, and then hit Enter:

After you hit Enter, you should not see any new output in the Command Prompt or Terminal window. However, if you minimize all of your open windows and look at your Desktop there should be a new CSV file called sales_summary.csv. Open the file. The contents should look like:

Figure IV: Output File - sales_summary.csv

Figure IV: Output File – sales_summary.csv

As you can see, a single header row and the six rows of data from the two input files were successfully written to the output file, sales_summary.csv. Often, this procedure of concatenating multiple input files into a single output file is all you need to do to begin your analysis. However, sometimes you may not need all of the rows or columns in the output file. Or you may need to modify the data or perform a calculation before writing it to the output file. In many cases, you would only need to make slight modifications to the code discussed above to alter the data written to your output file.

In this example there were only two input files, but the code generalizes to basically as many input files as your computer can handle. So if you need to concatenate the data in a few dozen, hundred, or thousand CSV files the code is basically re-useable as-is. This ability to automate and scale repetitive procedures on files and data is one of the great advantages of learning to code (even a little bit).

Now you know that it only takes a few lines of code to automate a process that, given a larger number of files, would be time-consuming and error-prone or even infeasible. Being able to quickly accomplish a task that would be difficult or impossible to do manually is empowering. Add in the benefit of eliminating manual “copy/paste” errors (once you debug your code), and the new capability is really exciting. Having read this post, I hope you’re now more familiar with Python and eager to begin using it. If you have any questions, please reply to this post.

Advertisements

33 thoughts on “Too Many Files? Feed Them to Python’s Voracious Glob Module

  1. Hi Clinton- Thanks for sharing this post. I’m a newbie to Python and had a follow up question to your post.

    1) Your script does not indicate where your csv files are located that you are trying to combine. How does your script know which files to look at?

    • Hi Nikola,

      Thanks for your question. In the tutorial, right after Figure III, I ask the reader to navigate to his or her Desktop. The Desktop is the “folder” in which we saved the CSV files we want to process. The single period ‘.’ represents your currently directory, that is, the folder you’re currently in. So, once you’ve navigated to your Desktop, the single period ‘.’ in the two equivalent commands shown below represents the path to the “folder” that’s holding the CSV files.

      Windows:
      python process_many_csv_files.py . sales_summary.csv

      Mac:
      ./process_many_csv_files.py . sales_summary.csv

      **To be more explicit, you can modify the two commands to:

      Windows:
      python process_many_csv_files.py “C:\Users\Clinton\Desktop\” sales_summary.csv

      Mac:
      ./process_many_csv_files.py /Users/[your name]/Desktop/ sales_summary.csv

      **For example, for me, the Mac command looks like: ./process_many_csv_files.py /Users/clinton/Desktop/ sales_summary.csv

      If you want to save your CSV files in a different folder, for example, in a folder called “my_csv_files” in your “Documents” folder, then the command would look like:

      ./process_many_csv_files.py /Users/clinton/Documents/my_csv_files/ sales_summary.csv

      Thanks again for your question. Please let me know if this response doesn’t answer your question or help you get the command working correctly for you.

      Sincerely,
      Clinton

    • Hi there,

      Thanks for your question. I called the variable input_path instead of input_file because it holds a path to a folder that contains multiple files instead of holding a path to a single file. However, since it’s simply a variable name, you can name the variable anything that makes sense to you. Please let me know if you have any additional questions.

      Sincerely,
      Clinton

  2. Hi Clinton-

    Thanks for this post! Very lucid and helpful. I a newbie to data analytics (I’m a PhD philosopher and cognitive scientist), and I’m glad there are people out there who take the time to share their expertise. All the best,

  3. Hi Clinton-

    I have a similar use case with one small tweak, I would like to only read data from certain columns (headers are in a different order in different files) and write all the rows of data for those columns into the resulting csv. You mention these possibility at the end of your post, is there an easy way to modify the code you’ve posted to be selective about column headers?

    Best

    • Hi Zack,

      Here is an example of finding specific columns by name/column heading, instead of by index position (since they can be in different orders in different files), in one CSV-formatted input file. If you need to process multiple input files, then use all of the lines from “my_columns = …” down to the bottom of the script and plug the code in beneath your glob line, e.g. for input_file in glob.glob(os.path.join(input_path,’*.csv’)): Please let me know if you have any questions:

      #!/usr/bin/python
      import csv
      import sys

      input_file = sys.argv[1]
      #output_file = sys.argv[2]

      my_columns = [‘Invoice Number’, ‘Purchase Date’]
      with open(input_file, ‘rU’) as csv_in_file:
          #with open(output_file, ‘wb’) as csv_out_file:
              filereader = csv.reader(csv_in_file, delimiter=’,’)
              #filewriter = csv.writer(csv_out_file, delimiter=’,’)
              header = next(filereader, None)
              header_list_output = [ ]
              for index_value in range(len(header)):
                  if header[index_value] in my_columns:
                      header_list_output.append(header[index_value])
              print header_list_output
              #filewriter.writerow(header_list_output)
              for row in filereader:
                  row_list_output = [ ]
                  for index_value in range(len(header)):
                      if header[index_value] in my_columns:
                          row_list_output.append(row[index_value])
                  print row_list_output
                  #filewriter.writerow(row_list_output)

      • Hi Zack,

        I think you’re on the right path. I had commented out the lines that actually write to an output file because I thought you might want to simply print to the screen while debugging, but in the new code that I show below I’ve uncommented the lines so you can write to an output file. I tested this code on a folder containing three CSV files, so it should work for you as-is if you have all of your CSV files in one folder, but please let me know if you have any questions. I ran the script on the command line as:

        ./CSVCombine.py /Users/clinton/Desktop/temp_folder/ combined_output.csv

        The code in CSVCombine.py is:

        #!/usr/bin/python
        import csv
        import glob
        import os
        import sys

        input_path = sys.argv[1]
        output_file = sys.argv[2]

        my_columns = [‘Street’, ‘Booked?’, ‘Notes’]
        first_file = True
        for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
            with open(input_file, ‘rU’) as csv_in_file:
                with open(output_file, ‘ab’) as csv_out_file:
                    filereader = csv.reader(csv_in_file, delimiter=’,’)
                    filewriter = csv.writer(csv_out_file, delimiter=’,’)
                    if not first_file:
                        header = next(filereader, None)
                        for row in filereader:
                            row_list_output = [ ]
                            for index_value in range(len(header)):
                                if header[index_value] in my_columns:
                                    row_list_output.append(row[index_value])
                            print row_list_output
                            filewriter.writerow(row_list_output)
                    else:
                        header = next(filereader, None)
                        header_list_output = [ ]
                        for index_value in range(len(header)):
                            if header[index_value] in my_columns:
                                header_list_output.append(header[index_value])
                        print header_list_output
                        filewriter.writerow(header_list_output)
                        for row in filereader:
                            row_list_output = [ ]
                            for index_value in range(len(header)):
                                if header[index_value] in my_columns:
                                    row_list_output.append(row[index_value])
                            print row_list_output
                            filewriter.writerow(row_list_output)
                        first_file = False

  4. Thanks Clinton, it works perfectly!

    Out of curiosity, is there an easy way to start from a specific row when reading in additional csvs? For example, instead of just reading the header from the first row of the file, could I start from row 3 to define the header? I assumed I would modify header = next(filereader, None) to define the header as a certain set of values from a defined row. Could I also explicitly define the column names in the header at the beginning of the script as an alternative to reading it from the first file?

    Thanks again!

    • Hi Zack,

      You can definitely start from a different row. If the header row is consistently the same row across the files, e.g. row 3, then it’s easier; otherwise, you’re going to need a file_counter variable and a bunch of if-elif-else statements to treat different files differently. So, assuming the header row is consistent across the files, a hacky but easy thing to do is to add as many of these lines “throw_away = next(filereader, None)” above the two existing lines “header = next(filereader, None)” as you need to skip the number of rows you want to pass over. For example, if the header row is always in row 3, then add two throw_away lines above each of the header lines to “throw away” the rows in the input files that you want to skip over.

      You can definitely define the column names in the header at the beginning instead of using the line: header = next(filereader, None). For example, you can add header = [‘ColumnHeader1’, ‘ColumnHeader2’, …] at the top beneath the line my_columns = [blah, blah]. Just know that the script uses the length of the header variable to scan across the columns in the CSV files, so if you define the header variable to have only two values, e.g. header = [‘ColumnHeader1’, ‘ColumnHeader2’], but your input files have 11 columns, then the script might not scan across the input columns far enough to find the columns you’re looking for. I haven’t modified the script and experimented with the effect, but just be aware. Hope this info helps.

      • Thanks Clinton, I was running into a few other issues that I’ve been able to get past.

        1.) The script was correctly identifying data in columns that matched ‘My Columns’ but in some instances the order of columns was different in the input file than the output file so I was getting data written to a column that didn’t match the column name

        2.) There were a couple pieces of additional data that I wanted to grab from the second row and append to all the rows being written for that file

        I was able to follow your format and add some additional code for these issues and everything is working! I was able to combine about 3.6K csv files all into one,:

        #!/usr/bin/python
        import csv
        import glob
        import os
        import sys

        input_path = sys.argv[1]
        output_file = sys.argv[2]

        my_columns = [‘Street’, ‘Booked?’, ‘Notes’, ‘Contact?’,’CB?’]
        with open(output_file, ‘ab’) as csv_out_file:
        filewriter = csv.writer(csv_out_file, delimiter=’,’)
        header = [‘Street’, ‘Booked?’, ‘Notes’, ‘Contact?’,’CB?’,’Town’,’Name/Date’]
        filewriter.writerow(header)
        for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
        with open(input_file, ‘rU’) as csv_in_file:
        if input_file == output_file:
        break
        else:
        with open(output_file, ‘ab’) as csv_out_file:
        filereader = csv.reader(csv_in_file, delimiter=’,’)
        filewriter = csv.writer(csv_out_file, delimiter=’,’)
        throw_away = next(filereader, None)
        extra_line = next(filereader, None)
        city = [‘0’]
        for index_value in range(len(extra_line)):
        print(‘extra_line len’)
        print(len(extra_line))
        print(extra_line)
        print(‘index value 1’)
        print(index_value)
        print(extra_line[index_value])
        print(type(extra_line[index_value]))
        if extra_line[index_value] == “”:
        print(‘is blank’)
        continue
        elif “Turf” in extra_line[index_value]:
        print(‘contains turf?’)
        city[0] = extra_line[index_value]
        else:
        print(‘continue’)
        continue
        print’city value’
        print(city)
        print(‘extra line’)
        print(extra_line)
        canvasser_name = extra_line[1]
        header = next(filereader, None)
        print(‘header length’)
        print(len(header))
        for row in filereader:
        if row[1] == “”:
        break
        if row[1] == ‘Totals:’:
        break
        else:
        row_list_output = [“”, “”, “”, “”, “”]
        for index_value in range(len(header)):
        print(‘index value’)
        print(index_value)
        print(‘row value’)
        print(row[index_value])
        print(‘header value’)
        print(header[index_value])
        print(‘range(len(header))’)
        print(range(len(header)))
        print(‘header full value’)
        print(header)
        if header[index_value] == my_columns[0]:
        print(‘my_column 0’)
        print(my_columns[0])
        print(header[index_value] == my_columns[0])
        row_list_output[0] = row[index_value]
        print(‘row_list_output 0’)
        print(row_list_output)
        elif header[index_value] == my_columns[1]:
        print(‘my_column 1’)
        print(my_columns[1])
        row_list_output[1] = row[index_value]
        print(‘row_list_output 1’)
        print(row_list_output)
        elif header[index_value] == my_columns[2]:
        print(‘my_column 2’)
        print(my_columns[2])
        row_list_output[2] = row[index_value]
        print(‘row_list_output 2’)
        print(row_list_output)
        elif header[index_value] == my_columns[3]:
        print(‘my_column 3’)
        print(my_columns[3])
        row_list_output[3] = row[index_value]
        print(‘row_list_output 3’)
        print(row_list_output)
        elif header[index_value] == my_columns[4]:
        print(‘my_column 4’)
        print(my_columns[4])
        print(‘index value’)
        print(row[index_value])
        row_list_output[4] = row[index_value]
        print(‘row_list_output 4’)
        print(row_list_output)
        else:
        print(‘Next!’)
        row_list_output.append(city)
        row_list_output.append(canvasser_name)
        print(‘final row list output’)
        print row_list_output
        filewriter.writerow(row_list_output)

        Thanks for all the help!

      • That is awesome! So glad to hear everything worked out! It was fun working on the problem. Have a great one!

  5. I really like this code snippet unfortunately if I do anything larger than 10 kb it gets stuck and doesn’t finish. Creating a large file size cvs instead if I do not cancel it. Otherwise small combines are great even though I do have to remove duplicates. Not sure why it is duplicating. This is for python 2.7 correct? I’ve grown use to the 3.4

  6. What if the the follow on CSV input files do not contain header rows but only data in the same format as the first CSV? How do I modify the inner conditional to disregard that header check?
    My output seems to be infinitely running, too.

    • Hi Jeremy, If all of the files after the first file don’t have header rows, then you can simplify the code and get rid of the if-else statement. Then the code will write the header row and data rows from the first file and then all of the data rows from the subsequent files and you won’t have any troublesome extra header rows in the output file. Hope this helps:

      #!/usr/bin/python
      import csv
      import glob
      import os
      import sys

      input_path = sys.argv[1]
      output_file = sys.argv[2]

      filewriter = csv.writer(open(output_file,’wb’))
      for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
          with open(input_file,’rU’) as csv_file:
              filereader = csv.reader(csv_file)
              for row in filereader:
                  filewriter.writerow(row)

  7. Reblogged this on jc0re and commented:
    Clinton’s post on using Python to process multiple CSVs was a huge help in conducting analysis! The write up is easy to follow, perfect for the novice user like me to follow and references an earlier post that covers the basics.

  8. Thanks for sharing your coding skill. I’m quite new to python and I’m using 2.7 and IDLE.
    I get a syntax error on the line “filewriter = csv.writer(open(output_file,’wb’))”. IDLE puts up a dialog box and inserts “# -*- coding: utf-8 -*-” above the imports.

    I can’t see where the non-ascii code might be that might be the cause of the problem. I don’t see anyone else complaining about this problem so I must have missed something quite basic. Can you advise me?

  9. I found the answer to my own question. I lazily copied and pasted your code into IDLE and somehow non-acsii single-quotes were included. Simply re-typing all the single quotes fixed the problem.

    Thanks again for your most instructive tutorial.

    • Hi Steve,

      Sorry for the late response. That’s awesome that you figured it out. I run into the same problem sometimes when I transfer code between PCs and Macs. I recently made the switch from Python 2 to Python 3, so I’m also getting used to some of the new/different syntax. It’s been a while since I posted on the blog because I’ve been working on a Python book, which will be released by O’Reilly Media later this year. I look forward to getting back to the blog. I wish you the best with Python. I’ve found it to be incredibly fun and empowering. All the best.

  10. Hi,
    thank you for this helpful post, I tried code and it worked for files that had few rows, but when i tried it for 2 files with 20,000 rows the file writing process takes a very long time, so i need to break the process. And when i breaked the process i find that the output file has a million rows or more when it is suppose to be less than that. I can’t figure out why this is happening. Any help would be greatly appreciated.

    I used this code
    #!/usr/bin/python
    import csv
    import glob
    import os
    import sys

    input_path = sys.argv[1]
    output_file = sys.argv[2]

    csv_out = open(output_file,’wb’)
    filewriter = csv.writer(csv_out)

    for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
    with open(input_file,’rU’) as csv_file:
    filereader = csv.reader(csv_file)
    for row in filereader:
    filewriter.writerow(row)

  11. Hello Clinton

    Many thanks your post. This looks like just the code I need to merge a large number of csv files. However I get a

    When I execute python process_many_csv_files.py . combination.csv I get
    filewriter = csv.writer(open(output_file,u2019wb\u2019))
    syntax error: invalid character in identifier.

    Might it be something to do with the encoding option for the Python code? Currently UTF-8?

    Many thanks, Chris

    • Hello Chris,

      Are you working with Python 2.x or 3.x and with Windows or Mac? Here are a few ideas:

      1. The Windows curvy single quote character isn’t valid:
      http://stackoverflow.com/questions/24915864/beautifulsoup-error-invalid-character-in-identifier

      2. You can try adding the following lines at the top of your script:

      #!/usr/bin/env python
      # -*- coding: utf-8 -*-

      3. You can try using codecs to read the file:
      https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

      import codecs
      f = codecs.open(‘unicode.rst’, encoding=’utf-8′)
      for line in f:
          print repr(line)

      4. Here is similar Python 3.x code:

      #!/usr/bin/env python3
      import csv
      import glob
      import os
      import sys

      input_path = sys.argv[1]
      output_file = sys.argv[2]

      first_file = True
      for input_file in glob.glob(os.path.join(input_path,’sales_*’)):
          print(os.path.basename(input_file))
          with open(input_file, ‘r’, newline=”) as csv_in_file:
              with open(output_file, ‘a’, newline=”) as csv_out_file:
                  filereader = csv.reader(csv_in_file)
                  filewriter = csv.writer(csv_out_file)
                  if first_file:
                      for row in filereader:
                          filewriter.writerow(row)
                      first_file = False
                  else:
                      header = next(filereader)
                      for row in filereader:
                          filewriter.writerow(row)

      I hope one of these suggestions solves your problem, but let me know if they don’t and you can share more info about the problem. Thanks for reading and using the post!

  12. Hey Clinton, thanks for posting the great code.

    I modified it a bit (below) but am having a hard time with header differentiation. I either get no headers (with it as is) or if I switch the file_counter to >1 (or 0 or >=1) I get all the headers.

    Any help would be greatly appreciated.

    #!/usr/bin/python
    import csv
    import glob
    import os
    import sys

    filewriter = csv.writer(open(‘output.csv’, ‘wt’,newline=”))
    file_counter = 0
    for filename in glob.glob(‘*.csv’):
    with open(filename, ‘r’) as csv_file:
    filereader = csv.reader(csv_file)
    if file_counter < 1:
    for row in filereader:
    filewriter.writerow(row)
    file_counter += 1
    else:
    header = next(filereader,None)
    for row in filereader:
    filewriter.writerow(row)
    file_counter += 1

    • Hi Ronald,

      My guess is that having “file_counter += 1″ in both sections is causing the issue. Here is an updated Python 3 version that uses a binary True/False variable to distinguish between the first and subsequent files instead of a counter variable. Please let me know if you run into any additional issues. Happy concatenating!

      #!/usr/bin/env python3
      import csv
      import glob
      import os
      import sys

      input_path = sys.argv[1]
      output_file = sys.argv[2]

      first_file = True
      for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
              print(os.path.basename(input_file))
              with open(input_file, ‘r’, newline=”) as csv_in_file:
                      with open(output_file, ‘a’, newline=”) as csv_out_file:
                              filereader = csv.reader(csv_in_file)
                              filewriter = csv.writer(csv_out_file)
                              if first_file:
                                      for row in filereader:
                                              filewriter.writerow(row)
                                      first_file = False
                              else:
                                      header = next(filereader)
                                      for row in filereader:
                                              filewriter.writerow(row)

  13. Clinton,

    Thanks brother! The first_file syntax worked great.

    I still want to just run it in whatever folder the script resides in (the folks who will be using it aren’t going to be up for arguments 😉 ) and I wanted some insurance against it looping through the merged file a dozen times or appending the results to the existing merged file each time it ran, so I updated the code a bit (below for your reference).

    Thanks again!

    import csv
    import glob
    import os
    import sys

    for f in glob.glob(“*.csv”): # find all csv files
    if f.startswith(“merged.csv”):
    os.remove(f) # if file name starts with merged.csv delete it. This prevents appending multiple copies of the same data.

    first_file = True
    for input_file in glob.glob(“*.csv”):
    if input_file !=’merged.csv’:
    print(os.path.basename(input_file))
    with open(input_file, ‘r’) as csv_in_file:
    with open(‘merged.csv’, ‘a’, newline=”) as csv_out_file:
    filereader = csv.reader(csv_in_file)
    filewriter = csv.writer(csv_out_file)
    if first_file:
    for row in filereader:
    filewriter.writerow(row)
    first_file = False
    else:
    header = next(filereader)
    for row in filereader:
    filewriter.writerow(row)

  14. Hi, thank you for this! I need to tweak this code to print the entire row in the sheets of all my files (I have 20 of them), but based on a keyword. If the column “Company” contains a certain company name, let’s say: ABC Company, then print the entire row for that company only into the output file. Can you please help?

    • Hi Gabriela,

      Assuming the following two CSV files:

      cat file1.csv
      ID,Company,Count
      1,”ABC Company”,5
      2,”XYZ Company”,7
      3,company3,15
      4,”ABC Company”,4
      5,company4,11

      cat file2.csv
      ID,Company,Count
      6,”ABC Company”,25
      7,”XYZ Company”,7
      8,company3,15
      9,”ABC Company”,24
      10,company4,11

      Here’s code that concatenates the rows containing “ABC Company” in the Company column into the output file. You’ll need to change the two instances of row[1] to row[x], where x is the index position of the Company column in your CSV files. In my example CSV files, the Company column is the second column, so its index position is 1. You can then run the script with a command like:

      python concat_specific_rows.py my_folder ABC_Company.csv

      where my_folder is the name of the folder that contains your CSV files and ABC_Company.csv is the CSV output file containing the concatenated data from your input files.

      #!/usr/bin/env python3
      import csv
      import glob
      import os
      import sys

      input_path = sys.argv[1]
      output_file = sys.argv[2]

      first_file = True
      for input_file in glob.glob(os.path.join(input_path,’*.csv’)):
          print(os.path.basename(input_file))
          with open(input_file, ‘r’, newline=”) as csv_in_file:
              with open(output_file, ‘a’, newline=”) as csv_out_file:
                  filereader = csv.reader(csv_in_file)
                  filewriter = csv.writer(csv_out_file)
                  if first_file:
                      for row in filereader:
                          company_name = row[1]
                          if company_name == ‘ABC Company’:
                              filewriter.writerow(row)
                      first_file = False
                  else:
                      header = next(filereader)
                      for row in filereader:
                          company_name = row[1]
                          if company_name == ‘ABC Company’:
                              filewriter.writerow(row)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s