Python Forum
Comparing values in large txt files - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Comparing values in large txt files (/thread-16388.html)



Comparing values in large txt files - StevenVF - Feb-26-2019

Hello

I got a small assignment to make a script that runs through 2 (or 2-4 if possible) txt files which has quite a lot of lines (one of them is 1.3 million lines).

It is log files where each line is:

<value1>;<value2>;<YYYY-MM-DD-HH.MM.SS>;<value3>;<value4.1>,<value4.2>;;;;;;

There can be a hiccup in the logging tool so there can be a partial duplicate line where value1, value2, date and time, and value 3 is the same, but value 4.1 and 4.2 differs from the duplicate.

The issue is the amount of lines, i could probably make a horribly unoptimized tool with some list that populates line by line, but with this much data i'd like to do it properly.

Ideally i would make a script where i can drag and drop the txt files onto the program (i think with the use of "sys.argv") then it runs through the files and should there be a partial duplicate it says something like "duplicate found, txt1 line 21 and txt2 line 402" or something akin to it.

It is 2 different files, one can have 50.000 lines and the other 1.000.000, so i can't do a direct line by line comparison, but the values are the same.

Not asking for a solution, i would like to learn how to do it, just looking for some guidance as to which modules to use.

Thanks
// Steven


RE: Comparing values in large txt files - Larz60+ - Feb-26-2019

Read your file this way. The entire file will not get loaded at once:
with open("myfile.txt") as fp:
    for line in fp:
        process(line)
You can also read in blocks of the file using:
fp.read(size)
This is more tricky as now you have to parse lines, and take care of odd size on last read,
but it can be quite fast when you get it right. Read also reads bytes, which can also pose
an additional set of problems.

If you can afford the time, reading line by line is simpler.

For dealing with large files in general, you may want to read this: http://effbot.org/zone/wide-finder.htm
especially the 'A Multi-Processor Python Solution' section.


RE: Comparing values in large txt files - StevenVF - Feb-28-2019

Thank you Larz60+, i will have a look at that.