Jun-08-2017, 04:09 PM
The easy way to do it, with smallish files, is to do what you're doing. Load one of the files completely in memory so you know what to ignore, and then process the other file line by line. As long as the larger file is less than however much ram you have, that should be fine (though it may take a while to run).
Another way to do it, which would take an exceptionally long time but would work if you don't have enough ram to load the entire file, would be to go line-by-line through the file containing things you want to ignore, for EACH line in the other file. You'd only ever have two lines in memory at a time, but also it'd take hours to finish. You almost definitely don't want to do this.
ANOTHER way to do it, would be to create a small sqlite database, and insert all the rows of each csv into different tables, and then run a single query to let the database engine handle pruning duplicates. Something like:
Another way to do it, which would take an exceptionally long time but would work if you don't have enough ram to load the entire file, would be to go line-by-line through the file containing things you want to ignore, for EACH line in the other file. You'd only ever have two lines in memory at a time, but also it'd take hours to finish. You almost definitely don't want to do this.
ANOTHER way to do it, would be to create a small sqlite database, and insert all the rows of each csv into different tables, and then run a single query to let the database engine handle pruning duplicates. Something like:
insert into output_table select * from infile2 as in2 where not exists ( select 1 from infile1 as in1 where in1.col1 = in2.col1 and in1.col2 = in2.col2 and in1.col3 = in2.col3 and in1.col4 = in2.col4 )