Python Forum

Full Version: Remove extra lines from .csv file in comparison with another
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Dear all,

First of all thanks for your help, I am really very new at this so I don't even know if I'll be able to do it.

I have two .csv files containing a LOT of data, which are built like:

row1, row2, row3, row4

my problem is that one of those files contains extra lines from the other, and I wish to remove them

for instance the first looks like
1, 1, 1, xx,
2, 2, 2, yy,
3, 3, 3, ab,
4, 4, 4, cd,

and the second looks like,

1, 1, 1, xx,
2, 2, 2, yy,
3, 3, 3, ab,
3.5, 3.5 ,3.5, fg
4, 4, 4, cd,

And I want to find an easy way to remove that line by comparison between the two files. I tried with excel but it takes a seriously long time, and eats up all the memory from my computer. After all we're talking about roughly 700 thousand lines.

This is what I tried:

I merged the two files, and wrote the script below, by simply going through the forums.
but it's giving lots of errors, for starters it does not recognise the " signs"
import sys, io
import gzip, zipfile
import csv, sqlite3

from sys import argv

_, input, output = argv

inFile = csv.reader(open(input, “r”))

outFile = csv.writer(open(output, “w”))



listLines = set()

for row in inFile:
   key = (row[0])
   if key in listLines:
       continue

   else:
       outFile.writerow(row)
       listLines.append(row)

outFile.close()

inFile.close()
can anyone please help?
The easy way to do it, with smallish files, is to do what you're doing.  Load one of the files completely in memory so you know what to ignore, and then process the other file line by line.  As long as the larger file is less than however much ram you have, that should be fine (though it may take a while to run).

Another way to do it, which would take an exceptionally long time but would work if you don't have enough ram to load the entire file, would be to go line-by-line through the file containing things you want to ignore, for EACH line in the other file.  You'd only ever have two lines in memory at a time, but also it'd take hours to finish. You almost definitely don't want to do this.

ANOTHER way to do it, would be to create a small sqlite database, and insert all the rows of each csv into different tables, and then run a single query to let the database engine handle pruning duplicates.  Something like:
insert into output_table
select * from infile2 as in2
where not exists (
    select 1
    from infile1 as in1
    where in1.col1 = in2.col1
        and in1.col2 = in2.col2
        and in1.col3 = in2.col3
        and in1.col4 = in2.col4
)