Python Forum
Iterate 2 large text files across lines and replace lines in second file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Iterate 2 large text files across lines and replace lines in second file
#1
My problem is as follows:
suppose I have 2 huge (like 10GB) text files as follows:

file1:
ad
1a
2b
3c
...
file2:
10
0
2b
45
...
What I need to do is to iterate the 2 files simultaneously line by line (i.e. line 1 in file1 and line 1 in file2 together) and conditionally replace that line in file2 (e.g. if lines are equal, replace line2 with '0'). Note that this has to be done in place because the files are huge and cannot be loaded in memory. Can someone suggest a way?

My (non-working) code below:
with open("file1", 'r') as fileA,open("file2", 'r+') as fileB:
    for line1, line2 in zip(fileA, fileB):
        if line1 == line2:
            #replace line2 -> 0
Reply
#2
You can't really do it "in place" unless you never change the length of a line. Otherwise, you have to change the position of every character later in the file.

Instead, open a third file for writing. Conditionally copy every line in the file you want to change into the third file so it's correct. When you're done, rename the file to replace the original one.
Reply
#3
Writing to a third file would be very costly as the files are huge..
So if I do not change the length of the line, (e.g. replace a line with '00' in this example), how would I do it in place?
Reply
#4
Just untested idea: mmap with readline.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Let's take your example. Your first file looks like this (od output).

0000000    a   d  \n   1   a  \n   2   b  \n   3   c  \n   .   .   .  \n
You say you want it to look like this afterward:

0000000    1   0  \n   0  \n   2   b  \n   4   5  \n   .   .   .  \n
See that the shorter second line means you're going to have to rewrite every byte that follows in the file since you've shifted the positions. You can't rewrite just the line that has changed. That's the advantage of a database with fixed-length fields. They can be rewritten in place cheaply.

Since you're already reading and writing every byte, you might as well use another file. There's no additional I/O cost, just disk space.

You could potentially rewrite every byte in the first file by reading a block, and writing the new block. But that means during the operation your file is inconsistent. If the program were to crash, you'd have a file that was half old and half new.

If you leave them the same length, then it should be possible to a replacement write. Uncommon, but we should be able to find an example.
Reply
#6
Yes as I said in my previous post, its ok with me to replace a line with a string of the same name. The files are huge so the additional disk I/O from the third file would slow things a lot.
Can you suggest code to replace the string in that line with another one of same length cheaply?
Reply
#7
Yes, just taking a while. The simple way is to do a text read, but this is actually quite fragile because the length of a character is not fixed. Any non-ascii characters in the file and this will break. You can get unchanged data, or even invalid UTF-8 that can't be read as text. You'd need to change this instead to a binary read and bytes operations rather than str operations. But this at least shows the principle.

modify.txt before running:
Output:
line1 line2 line3 line4
with open ("modify.txt", "r+") as f:
    line_start = f.tell()
    line = f.readline()
    while line:
        x = line.find("3")  # replace any line with a 3 in it.
        if x < 0:
            pass
        else:
            l = len(line)
            print (f"found line3 starting on offset {line_start}")
            # The line is exactly l characters long (with newlines)
            replace = "X" * (l-1)  # change to whatever you want to replace with.
            f.seek(line_start)
            f.write(replace)       # f.write only overwrites the characters, not the newline
        line_start = f.tell()
        line = f.readline()
After running:
Output:
line1 line2 XXXXX line4
Reply
#8
(Aug-10-2020, 05:22 PM)bowlofred Wrote: Yes, just taking a while. The simple way is to do a text read, but this is actually quite fragile because the length of a character is not fixed. Any non-ascii characters in the file and this will break. You can get unchanged data, or even invalid UTF-8 that can't be read as text. You'd need to change this instead to a binary read and bytes operations rather than str operations. But this at least shows the principle.

modify.txt before running:
Output:
line1 line2 line3 line4
with open ("modify.txt", "r+") as f:
    line_start = f.tell()
    line = f.readline()
    while line:
        x = line.find("3")  # replace any line with a 3 in it.
        if x < 0:
            pass
        else:
            l = len(line)
            print (f"found line3 starting on offset {line_start}")
            # The line is exactly l characters long (with newlines)
            replace = "X" * (l-1)  # change to whatever you want to replace with.
            f.seek(line_start)
            f.write(replace)       # f.write only overwrites the characters, not the newline
        line_start = f.tell()
        line = f.readline()
After running:
Output:
line1 line2 XXXXX line4

So for my original example, I did the following:

with open("file1", 'r') as fileA, open("file2", 'r+') as fileB:
    line_start = fileB.tell()
    for line1, line2 in zip(fileA, fileB):
        if line1 == line2:
            replace = "00"
            fileB.seek(line_start)
            fileB.write(replace)
However this does not work as it only appends two zeroes in the EOF.
Also if I invoke fileB.tell() within the for loop I get an error
OSError: telling position disabled by next() call
...
Reply
#9
Yes, this uses the filehandle as an iterator, and it causes problems for the tell(). I had to instead modify the loop that explicit readline() calls were used.
Reply
#10
Hmm so I'm confused. How would I do it?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Line graph with two superimposed lines sawtooth500 4 328 Apr-02-2024, 08:56 PM
Last Post: sawtooth500
  replace text in a txt cartonics 19 2,232 Jan-30-2024, 06:58 AM
Last Post: Athi
  Python and pandas: Aggregate lines form Excel sheet Glyxbringer 12 1,870 Oct-31-2023, 10:21 AM
Last Post: Pedroski55
  Replace a text/word in docx file using Python Devan 4 3,368 Oct-17-2023, 06:03 PM
Last Post: Devan
  How to insert Dashed Lines in between Rows of a tabulate output Mudassir1987 0 506 Sep-27-2023, 10:09 AM
Last Post: Mudassir1987
  Need to replace a string with a file (HTML file) tester_V 1 763 Aug-30-2023, 03:42 AM
Last Post: Larz60+
  detect if two lines are crossing bast0s4 2 710 Aug-16-2023, 04:23 PM
Last Post: Gribouillis
  Converted EXE file size is too large Rajasekaran 0 1,516 Mar-30-2023, 11:50 AM
Last Post: Rajasekaran
  What are these python lines for? What are tey doing? Led_Zeppelin 7 1,616 Feb-13-2023, 03:08 PM
Last Post: deanhystad
  azure TTS from text files to mp3s mutantGOD 2 1,702 Jan-17-2023, 03:20 AM
Last Post: mutantGOD

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020