Comparing values in large txt files

StevenVF · Feb-26-2019, 09:28 AM

Hello

I got a small assignment to make a script that runs through 2 (or 2-4 if possible) txt files which has quite a lot of lines (one of them is 1.3 million lines).

It is log files where each line is:

<value1>;<value2>;<YYYY-MM-DD-HH.MM.SS>;<value3>;<value4.1>,<value4.2>;;;;;;

There can be a hiccup in the logging tool so there can be a partial duplicate line where value1, value2, date and time, and value 3 is the same, but value 4.1 and 4.2 differs from the duplicate.

The issue is the amount of lines, i could probably make a horribly unoptimized tool with some list that populates line by line, but with this much data i'd like to do it properly.

Ideally i would make a script where i can drag and drop the txt files onto the program (i think with the use of "sys.argv") then it runs through the files and should there be a partial duplicate it says something like "duplicate found, txt1 line 21 and txt2 line 402" or something akin to it.

It is 2 different files, one can have 50.000 lines and the other 1.000.000, so i can't do a direct line by line comparison, but the values are the same.

Not asking for a solution, i would like to learn how to do it, just looking for some guidance as to which modules to use.

Thanks
// Steven

**Larz60+** · (This post was last modified: Feb-26-2019, 12:12 PM by Larz60+.)

Read your file this way. The entire file will not get loaded at once:

with open("myfile.txt") as fp:
    for line in fp:
        process(line)

You can also read in blocks of the file using:

fp.read(size)

This is more tricky as now you have to parse lines, and take care of odd size on last read,
but it can be quite fast when you get it right. Read also reads bytes, which can also pose
an additional set of problems.

If you can afford the time, reading line by line is simpler.

For dealing with large files in general, you may want to read this: http://effbot.org/zone/wide-finder.htm
especially the 'A Multi-Processor Python Solution' section.

StevenVF · Feb-28-2019, 09:07 AM

Thank you Larz60+, i will have a look at that.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Comparing List values to get indexes	Edward_	7	1,190	Jun-09-2023, 04:57 PM Last Post: deanhystad
	Iterate 2 large text files across lines and replace lines in second file	medatib531	13	5,885	Aug-10-2020, 11:01 PM Last Post: medatib531
	Iterating Large Files	Robotguy	10	5,194	Jul-22-2020, 09:13 PM Last Post: Gribouillis
	Comparing Values/QC Within Two Strings	uttadms31	2	1,916	Jul-07-2020, 03:49 PM Last Post: uttadms31
	best option for comparing two csv files	zuzuzu	1	2,137	Apr-15-2020, 05:20 PM Last Post: Larz60+
	Error With Reading Files In Directory And Calculating Values	chascp	2	2,454	Feb-15-2020, 01:57 PM Last Post: chascp
	Handling Large XML Files (>10GB) in Python	onlydibs	1	4,220	Dec-22-2019, 05:46 AM Last Post: Clunk_Head
	Comparing columns of Matrix stored in .txt files	JoelFooCJ	2	2,272	Dec-11-2019, 07:21 AM Last Post: JoelFooCJ
	Segmentation fault with large files	kusal1	3	2,773	Oct-01-2019, 07:32 AM Last Post: Gribouillis
	How can I compare Python XML-Files and add missing values from one to another	kirat	2	2,699	Aug-30-2019, 12:17 PM Last Post: perfringo

Comparing values in large txt files

User Panel Messages

Announcements