Python Forum

Full Version: Split csv file based on column value
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello Python experts,

I have very large csv file (millions of rows) that I need to split into about 300 files based on a column with names.
I have search for an solution and did find this script below, but as a rather new user of Python can't get it to work properly.
My column is string based and the script was looking for integer, I did change "askinteger" it to "askstring" and "column = string" but that did not help. Can it be that the file I'm using is ";" semicolon separated and not comma separated?
I can choose file and directory to save the file as well as the column the script should work with but there is no result at all no files... what wrong and what is the needed to changes for it to work?

Thanks

#!/usr/bin/env python3

import binascii
import csv
import os.path
import sys
from tkinter.filedialog import askopenfilename, askdirectory
from tkinter.simpledialog import askinteger
 
def split_csv_file(f, dst_dir, keyfunc):
    csv_reader = csv.reader(f)
    csv_writers = {}
    for row in csv_reader:
        k = keyfunc(row)
        if k not in csv_writers:
            csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k), mode='w', newline=''))

        csv_writers[k].writerow(row)

def get_args_from_cli():
    input_filename = sys.argv[1]
    column = int(sys.argv[2])
    dst_dir = sys.argv[3]
    return (input_filename, column, dst_dir)

def get_args_from_gui():
    input_filename = askopenfilename(
        filetypes=(('CSV', '.csv'),),
        title='Select CSV Input File')
    column = askinteger('Choose Table Column', 'Table column')
    dst_dir = askdirectory(title='Select Destination Directory')
    return (input_filename, column, dst_dir)

if __name__ == '__main__':
    if len(sys.argv) == 1:
        input_filename, column, dst_dir = get_args_from_gui()
    elif len(sys.argv) == 4:
        input_filename, column, dst_dir = get_args_from_cli()
    else:
        raise Exception("Invalid number of arguments")
    with open(input_filename, mode='r', newline='') as f:
        split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')

        # if the column has funky values resulting in invalid filenames
        # replace the line from above with:
        # split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')
I suspect the problem is on line 16. Opening the file in write mode will cause each successive row to overwrite the previous one. So, each file would only have one row if it was outputting anything. Plus, "w" cannot create a file as I recall. Trying opening the files with "a+" mode for append and create.
I did change the mode to a+ as suggested below, but still there is no output. All seems to work OK when it comes to select file and where to save it but there is no output at all...
Thanks

if k not in csv_writers:
            csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k), mode='a+', newline=''))
and also change this...
with open(input_filename, mode='a+', newline='') as f:
        split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
You also mentioned the CSV is separated by semi-colons. That would cause a parsing issue with the default csv.Reader. Per the documentation, change line 11 to:

csv_reader = csv.reader(f, delimiter=";")
Thanks, did add the delimiter but sorry to say, there is still no output, seems the process dies after I have pointed to the folder to save the output. I do run the script via the command-line