Python Forum

Full Version: Too big CSV file management
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey!

I am pretty new to "pandas" in python and I would like to ask for some help. I don't think it's complicated, I just can't figure it out. I have a huge CSV file (around 2 gigabytes, 4,4 million lines), excel cant open it fully. There is a very small part of it that I need, and everything else could be deleted. I only need the rows where "PUBLIC LIMITED COMPANY" or "PLC" appears as a substring in column A (I need the whole row where it does appear). These could be added to a new csv/excel file or it could be done in a way that everything else gets deleted in this one besides the ones we need. The filename is "AllCompanies.csv".

Thank you for your help!
Here's an outline of the cod you would want:

with open('AllCompanies.csv') as in_file:
    with open('PLCCompanies.csv', 'w') as out_file:
        for line in in_file:
            if line_matches_criteria:
                out_file.write(line)
The for loop will read the file one line at a time, so it doesn't clog your memory.
maybe you can cobvert it to sqlite (-x $'\t' means delimiter tab) with csv-to-sqlite

csv-to-sqlite -x $'\t' -f /path/to/file.csv -o /path/to/file.db
That was so simple I started to wonder how that not came to my mind lol. Thank you, it worked like a charm. Had to add UTF8 encoding to it ,in the end it looked like this:

with open('AllCompanies.csv', encoding="utf-8") as in_file:
    with open('PLCCompanies.csv', 'w', encoding="utf-8") as out_file:
        for line in in_file:
            if "PLC" in str(line) or "PUBLIC LIMITED COMPANY" in str(line):
                out_file.write(line)