Python Forum
Remove extra lines from .csv file in comparison with another
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Remove extra lines from .csv file in comparison with another
#1
Dear all,

First of all thanks for your help, I am really very new at this so I don't even know if I'll be able to do it.

I have two .csv files containing a LOT of data, which are built like:

row1, row2, row3, row4

my problem is that one of those files contains extra lines from the other, and I wish to remove them

for instance the first looks like
1, 1, 1, xx,
2, 2, 2, yy,
3, 3, 3, ab,
4, 4, 4, cd,

and the second looks like,

1, 1, 1, xx,
2, 2, 2, yy,
3, 3, 3, ab,
3.5, 3.5 ,3.5, fg
4, 4, 4, cd,

And I want to find an easy way to remove that line by comparison between the two files. I tried with excel but it takes a seriously long time, and eats up all the memory from my computer. After all we're talking about roughly 700 thousand lines.

This is what I tried:

I merged the two files, and wrote the script below, by simply going through the forums.
but it's giving lots of errors, for starters it does not recognise the " signs"
import sys, io
import gzip, zipfile
import csv, sqlite3

from sys import argv

_, input, output = argv

inFile = csv.reader(open(input, “r”))

outFile = csv.writer(open(output, “w”))



listLines = set()

for row in inFile:
   key = (row[0])
   if key in listLines:
       continue

   else:
       outFile.writerow(row)
       listLines.append(row)

outFile.close()

inFile.close()
can anyone please help?
Reply
#2
The easy way to do it, with smallish files, is to do what you're doing.  Load one of the files completely in memory so you know what to ignore, and then process the other file line by line.  As long as the larger file is less than however much ram you have, that should be fine (though it may take a while to run).

Another way to do it, which would take an exceptionally long time but would work if you don't have enough ram to load the entire file, would be to go line-by-line through the file containing things you want to ignore, for EACH line in the other file.  You'd only ever have two lines in memory at a time, but also it'd take hours to finish. You almost definitely don't want to do this.

ANOTHER way to do it, would be to create a small sqlite database, and insert all the rows of each csv into different tables, and then run a single query to let the database engine handle pruning duplicates.  Something like:
insert into output_table
select * from infile2 as in2
where not exists (
    select 1
    from infile1 as in1
    where in1.col1 = in2.col1
        and in1.col2 = in2.col2
        and in1.col3 = in2.col3
        and in1.col4 = in2.col4
)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Photo String comparison in a csv file in Python Pandas fleafy 2 1,168 Nov-18-2022, 09:38 PM
Last Post: fleafy
  Delete multiple lines from txt file Lky 6 2,320 Jul-10-2022, 12:09 PM
Last Post: jefsummers
  failing to print not matched lines from second file tester_V 14 6,127 Apr-05-2022, 11:56 AM
Last Post: codinglearner
  Extracting Specific Lines from text file based on content. jokerfmj 8 3,056 Mar-28-2022, 03:38 PM
Last Post: snippsat
  Importing a function from another file runs the old lines also dedesssse 6 2,577 Jul-06-2021, 07:04 PM
Last Post: deanhystad
  [Solved] Trying to read specific lines from a file Laplace12 7 3,570 Jun-21-2021, 11:15 AM
Last Post: Laplace12
  all i want to do is count the lines in each file Skaperen 13 4,847 May-23-2021, 11:24 PM
Last Post: Skaperen
  More elegant way to remove time from text lines. Pedroski55 6 3,961 Apr-25-2021, 03:18 PM
Last Post: perfringo
  Remove single and double quotes from a csv file in 3 to 4 column shantanu97 0 7,001 Mar-31-2021, 10:52 AM
Last Post: shantanu97
  Remove Blank Lines from docx table and paragraphs bsudhirk001 1 3,726 Feb-14-2021, 12:38 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020