Python Forum
Split csv file based on column value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Split csv file based on column value
#1
Hello Python experts,

I have very large csv file (millions of rows) that I need to split into about 300 files based on a column with names.
I have search for an solution and did find this script below, but as a rather new user of Python can't get it to work properly.
My column is string based and the script was looking for integer, I did change "askinteger" it to "askstring" and "column = string" but that did not help. Can it be that the file I'm using is ";" semicolon separated and not comma separated?
I can choose file and directory to save the file as well as the column the script should work with but there is no result at all no files... what wrong and what is the needed to changes for it to work?

Thanks

#!/usr/bin/env python3

import binascii
import csv
import os.path
import sys
from tkinter.filedialog import askopenfilename, askdirectory
from tkinter.simpledialog import askinteger
 
def split_csv_file(f, dst_dir, keyfunc):
    csv_reader = csv.reader(f)
    csv_writers = {}
    for row in csv_reader:
        k = keyfunc(row)
        if k not in csv_writers:
            csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k), mode='w', newline=''))

        csv_writers[k].writerow(row)

def get_args_from_cli():
    input_filename = sys.argv[1]
    column = int(sys.argv[2])
    dst_dir = sys.argv[3]
    return (input_filename, column, dst_dir)

def get_args_from_gui():
    input_filename = askopenfilename(
        filetypes=(('CSV', '.csv'),),
        title='Select CSV Input File')
    column = askinteger('Choose Table Column', 'Table column')
    dst_dir = askdirectory(title='Select Destination Directory')
    return (input_filename, column, dst_dir)

if __name__ == '__main__':
    if len(sys.argv) == 1:
        input_filename, column, dst_dir = get_args_from_gui()
    elif len(sys.argv) == 4:
        input_filename, column, dst_dir = get_args_from_cli()
    else:
        raise Exception("Invalid number of arguments")
    with open(input_filename, mode='r', newline='') as f:
        split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')

        # if the column has funky values resulting in invalid filenames
        # replace the line from above with:
        # split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')
Reply
#2
I suspect the problem is on line 16. Opening the file in write mode will cause each successive row to overwrite the previous one. So, each file would only have one row if it was outputting anything. Plus, "w" cannot create a file as I recall. Trying opening the files with "a+" mode for append and create.
Reply
#3
I did change the mode to a+ as suggested below, but still there is no output. All seems to work OK when it comes to select file and where to save it but there is no output at all...
Thanks

if k not in csv_writers:
            csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k), mode='a+', newline=''))
and also change this...
with open(input_filename, mode='a+', newline='') as f:
        split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
Reply
#4
You also mentioned the CSV is separated by semi-colons. That would cause a parsing issue with the default csv.Reader. Per the documentation, change line 11 to:

csv_reader = csv.reader(f, delimiter=";")
Reply
#5
Thanks, did add the delimiter but sorry to say, there is still no output, seems the process dies after I have pointed to the folder to save the output. I do run the script via the command-line
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Copy Paste excel files based on the first letters of the file name Viento 2 406 Feb-07-2024, 12:24 PM
Last Post: Viento
  Help copying a column from a csv to another file with some extras g0nz0uk 3 455 Feb-01-2024, 03:12 PM
Last Post: DeaD_EyE
  How to "tee" (=split) output to screen and into file? pstein 6 1,351 Jun-24-2023, 08:00 AM
Last Post: Gribouillis
  Split pdf in pypdf based upon file regex standenman 1 2,060 Feb-03-2023, 12:01 PM
Last Post: SpongeB0B
  How to read csv file update matplotlib column chart regularly SamLiu 2 1,053 Jan-21-2023, 11:33 PM
Last Post: SamLiu
  create new column based on condition arvin 12 2,216 Dec-13-2022, 04:53 PM
Last Post: jefsummers
  How to assign a value to pandas dataframe column rows based on a condition klllmmm 0 822 Sep-08-2022, 06:32 AM
Last Post: klllmmm
  Read xml column inside csv file with Python estertabita 2 1,349 Jul-26-2022, 06:09 PM
Last Post: Larz60+
  Python Split json into separate json based on node value CzarR 1 5,565 Jul-08-2022, 07:55 PM
Last Post: Larz60+
  Openpyxl-change value of cells in column based on value that currently occupies cells phillipaj1391 5 9,740 Mar-30-2022, 11:05 PM
Last Post: Pedroski55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020