Python Forum
Comparing values in large txt files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Comparing values in large txt files
#1
Hello

I got a small assignment to make a script that runs through 2 (or 2-4 if possible) txt files which has quite a lot of lines (one of them is 1.3 million lines).

It is log files where each line is:

<value1>;<value2>;<YYYY-MM-DD-HH.MM.SS>;<value3>;<value4.1>,<value4.2>;;;;;;

There can be a hiccup in the logging tool so there can be a partial duplicate line where value1, value2, date and time, and value 3 is the same, but value 4.1 and 4.2 differs from the duplicate.

The issue is the amount of lines, i could probably make a horribly unoptimized tool with some list that populates line by line, but with this much data i'd like to do it properly.

Ideally i would make a script where i can drag and drop the txt files onto the program (i think with the use of "sys.argv") then it runs through the files and should there be a partial duplicate it says something like "duplicate found, txt1 line 21 and txt2 line 402" or something akin to it.

It is 2 different files, one can have 50.000 lines and the other 1.000.000, so i can't do a direct line by line comparison, but the values are the same.

Not asking for a solution, i would like to learn how to do it, just looking for some guidance as to which modules to use.

Thanks
// Steven
Reply
#2
Read your file this way. The entire file will not get loaded at once:
with open("myfile.txt") as fp:
    for line in fp:
        process(line)
You can also read in blocks of the file using:
fp.read(size)
This is more tricky as now you have to parse lines, and take care of odd size on last read,
but it can be quite fast when you get it right. Read also reads bytes, which can also pose
an additional set of problems.

If you can afford the time, reading line by line is simpler.

For dealing with large files in general, you may want to read this: http://effbot.org/zone/wide-finder.htm
especially the 'A Multi-Processor Python Solution' section.
Reply
#3
Thank you Larz60+, i will have a look at that.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Comparing List values to get indexes Edward_ 7 1,083 Jun-09-2023, 04:57 PM
Last Post: deanhystad
  Iterate 2 large text files across lines and replace lines in second file medatib531 13 5,706 Aug-10-2020, 11:01 PM
Last Post: medatib531
  Iterating Large Files Robotguy 10 5,057 Jul-22-2020, 09:13 PM
Last Post: Gribouillis
  Comparing Values/QC Within Two Strings uttadms31 2 1,876 Jul-07-2020, 03:49 PM
Last Post: uttadms31
  best option for comparing two csv files zuzuzu 1 2,092 Apr-15-2020, 05:20 PM
Last Post: Larz60+
  Error With Reading Files In Directory And Calculating Values chascp 2 2,381 Feb-15-2020, 01:57 PM
Last Post: chascp
  Handling Large XML Files (>10GB) in Python onlydibs 1 4,141 Dec-22-2019, 05:46 AM
Last Post: Clunk_Head
  Comparing columns of Matrix stored in .txt files JoelFooCJ 2 2,227 Dec-11-2019, 07:21 AM
Last Post: JoelFooCJ
  Segmentation fault with large files kusal1 3 2,692 Oct-01-2019, 07:32 AM
Last Post: Gribouillis
  How can I compare Python XML-Files and add missing values from one to another kirat 2 2,628 Aug-30-2019, 12:17 PM
Last Post: perfringo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020