Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Comparing values in large txt files
#1
Hello

I got a small assignment to make a script that runs through 2 (or 2-4 if possible) txt files which has quite a lot of lines (one of them is 1.3 million lines).

It is log files where each line is:

<value1>;<value2>;<YYYY-MM-DD-HH.MM.SS>;<value3>;<value4.1>,<value4.2>;;;;;;

There can be a hiccup in the logging tool so there can be a partial duplicate line where value1, value2, date and time, and value 3 is the same, but value 4.1 and 4.2 differs from the duplicate.

The issue is the amount of lines, i could probably make a horribly unoptimized tool with some list that populates line by line, but with this much data i'd like to do it properly.

Ideally i would make a script where i can drag and drop the txt files onto the program (i think with the use of "sys.argv") then it runs through the files and should there be a partial duplicate it says something like "duplicate found, txt1 line 21 and txt2 line 402" or something akin to it.

It is 2 different files, one can have 50.000 lines and the other 1.000.000, so i can't do a direct line by line comparison, but the values are the same.

Not asking for a solution, i would like to learn how to do it, just looking for some guidance as to which modules to use.

Thanks
// Steven
Quote
#2
Read your file this way. The entire file will not get loaded at once:
with open("myfile.txt") as fp:
    for line in fp:
        process(line)
You can also read in blocks of the file using:
fp.read(size)
This is more tricky as now you have to parse lines, and take care of odd size on last read,
but it can be quite fast when you get it right. Read also reads bytes, which can also pose
an additional set of problems.

If you can afford the time, reading line by line is simpler.

For dealing with large files in general, you may want to read this: http://effbot.org/zone/wide-finder.htm
especially the 'A Multi-Processor Python Solution' section.
Quote
#3
Thank you Larz60+, i will have a look at that.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  best option for comparing two csv files zuzuzu 1 149 Apr-15-2020, 05:20 PM
Last Post: Larz60+
  Error With Reading Files In Directory And Calculating Values chascp 2 208 Feb-15-2020, 01:57 PM
Last Post: chascp
  Handling Large XML Files (>10GB) in Python onlydibs 1 399 Dec-22-2019, 05:46 AM
Last Post: Clunk_Head
  Comparing columns of Matrix stored in .txt files JoelFooCJ 2 335 Dec-11-2019, 07:21 AM
Last Post: JoelFooCJ
  Segmentation fault with large files kusal1 3 314 Oct-01-2019, 07:32 AM
Last Post: Gribouillis
  How can I compare Python XML-Files and add missing values from one to another kirat 2 358 Aug-30-2019, 12:17 PM
Last Post: perfringo
  Compare two large CSV files for a match Python_Newbie9 3 1,922 Apr-22-2019, 08:49 PM
Last Post: ichabod801
  Download multiple large json files at once halcynthis 0 489 Feb-14-2019, 08:41 AM
Last Post: halcynthis
  Looping through dictionary and comparing values with elements of a separate list. Mr_Keystrokes 5 978 Jun-22-2018, 03:08 PM
Last Post: wavic
  Comparing values in separate lists KaleBosRatjes 3 1,018 May-02-2018, 04:38 PM
Last Post: KaleBosRatjes

Forum Jump:


Users browsing this thread: 1 Guest(s)