Python Forum
Iterate 2 large text files across lines and replace lines in second file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Iterate 2 large text files across lines and replace lines in second file (/thread-28926.html)

Pages: 1 2


Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020

My problem is as follows:
suppose I have 2 huge (like 10GB) text files as follows:

file1:
ad
1a
2b
3c
...
file2:
10
0
2b
45
...
What I need to do is to iterate the 2 files simultaneously line by line (i.e. line 1 in file1 and line 1 in file2 together) and conditionally replace that line in file2 (e.g. if lines are equal, replace line2 with '0'). Note that this has to be done in place because the files are huge and cannot be loaded in memory. Can someone suggest a way?

My (non-working) code below:
with open("file1", 'r') as fileA,open("file2", 'r+') as fileB:
    for line1, line2 in zip(fileA, fileB):
        if line1 == line2:
            #replace line2 -> 0



RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020

You can't really do it "in place" unless you never change the length of a line. Otherwise, you have to change the position of every character later in the file.

Instead, open a third file for writing. Conditionally copy every line in the file you want to change into the third file so it's correct. When you're done, rename the file to replace the original one.


RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020

Writing to a third file would be very costly as the files are huge..
So if I do not change the length of the line, (e.g. replace a line with '00' in this example), how would I do it in place?


RE: Iterate 2 large text files across lines and replace lines in second file - perfringo - Aug-10-2020

Just untested idea: mmap with readline.


RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020

Let's take your example. Your first file looks like this (od output).

0000000    a   d  \n   1   a  \n   2   b  \n   3   c  \n   .   .   .  \n
You say you want it to look like this afterward:

0000000    1   0  \n   0  \n   2   b  \n   4   5  \n   .   .   .  \n
See that the shorter second line means you're going to have to rewrite every byte that follows in the file since you've shifted the positions. You can't rewrite just the line that has changed. That's the advantage of a database with fixed-length fields. They can be rewritten in place cheaply.

Since you're already reading and writing every byte, you might as well use another file. There's no additional I/O cost, just disk space.

You could potentially rewrite every byte in the first file by reading a block, and writing the new block. But that means during the operation your file is inconsistent. If the program were to crash, you'd have a file that was half old and half new.

If you leave them the same length, then it should be possible to a replacement write. Uncommon, but we should be able to find an example.


RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020

Yes as I said in my previous post, its ok with me to replace a line with a string of the same name. The files are huge so the additional disk I/O from the third file would slow things a lot.
Can you suggest code to replace the string in that line with another one of same length cheaply?


RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020

Yes, just taking a while. The simple way is to do a text read, but this is actually quite fragile because the length of a character is not fixed. Any non-ascii characters in the file and this will break. You can get unchanged data, or even invalid UTF-8 that can't be read as text. You'd need to change this instead to a binary read and bytes operations rather than str operations. But this at least shows the principle.

modify.txt before running:
Output:
line1 line2 line3 line4
with open ("modify.txt", "r+") as f:
    line_start = f.tell()
    line = f.readline()
    while line:
        x = line.find("3")  # replace any line with a 3 in it.
        if x < 0:
            pass
        else:
            l = len(line)
            print (f"found line3 starting on offset {line_start}")
            # The line is exactly l characters long (with newlines)
            replace = "X" * (l-1)  # change to whatever you want to replace with.
            f.seek(line_start)
            f.write(replace)       # f.write only overwrites the characters, not the newline
        line_start = f.tell()
        line = f.readline()
After running:
Output:
line1 line2 XXXXX line4



RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020

(Aug-10-2020, 05:22 PM)bowlofred Wrote: Yes, just taking a while. The simple way is to do a text read, but this is actually quite fragile because the length of a character is not fixed. Any non-ascii characters in the file and this will break. You can get unchanged data, or even invalid UTF-8 that can't be read as text. You'd need to change this instead to a binary read and bytes operations rather than str operations. But this at least shows the principle.

modify.txt before running:
Output:
line1 line2 line3 line4
with open ("modify.txt", "r+") as f:
    line_start = f.tell()
    line = f.readline()
    while line:
        x = line.find("3")  # replace any line with a 3 in it.
        if x < 0:
            pass
        else:
            l = len(line)
            print (f"found line3 starting on offset {line_start}")
            # The line is exactly l characters long (with newlines)
            replace = "X" * (l-1)  # change to whatever you want to replace with.
            f.seek(line_start)
            f.write(replace)       # f.write only overwrites the characters, not the newline
        line_start = f.tell()
        line = f.readline()
After running:
Output:
line1 line2 XXXXX line4

So for my original example, I did the following:

with open("file1", 'r') as fileA, open("file2", 'r+') as fileB:
    line_start = fileB.tell()
    for line1, line2 in zip(fileA, fileB):
        if line1 == line2:
            replace = "00"
            fileB.seek(line_start)
            fileB.write(replace)
However this does not work as it only appends two zeroes in the EOF.
Also if I invoke fileB.tell() within the for loop I get an error
OSError: telling position disabled by next() call
...


RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020

Yes, this uses the filehandle as an iterator, and it causes problems for the tell(). I had to instead modify the loop that explicit readline() calls were used.


RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020

Hmm so I'm confused. How would I do it?