![]() |
Iterate 2 large text files across lines and replace lines in second file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Iterate 2 large text files across lines and replace lines in second file (/thread-28926.html) Pages:
1
2
|
Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020 My problem is as follows: suppose I have 2 huge (like 10GB) text files as follows: file1: ad 1a 2b 3c ...file2: 10 0 2b 45 ...What I need to do is to iterate the 2 files simultaneously line by line (i.e. line 1 in file1 and line 1 in file2 together) and conditionally replace that line in file2 (e.g. if lines are equal, replace line2 with '0'). Note that this has to be done in place because the files are huge and cannot be loaded in memory. Can someone suggest a way? My (non-working) code below: with open("file1", 'r') as fileA,open("file2", 'r+') as fileB: for line1, line2 in zip(fileA, fileB): if line1 == line2: #replace line2 -> 0 RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020 You can't really do it "in place" unless you never change the length of a line. Otherwise, you have to change the position of every character later in the file. Instead, open a third file for writing. Conditionally copy every line in the file you want to change into the third file so it's correct. When you're done, rename the file to replace the original one. RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020 Writing to a third file would be very costly as the files are huge.. So if I do not change the length of the line, (e.g. replace a line with '00' in this example), how would I do it in place? RE: Iterate 2 large text files across lines and replace lines in second file - perfringo - Aug-10-2020 Just untested idea: mmap with readline. RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020 Let's take your example. Your first file looks like this (od output). 0000000 a d \n 1 a \n 2 b \n 3 c \n . . . \nYou say you want it to look like this afterward: 0000000 1 0 \n 0 \n 2 b \n 4 5 \n . . . \nSee that the shorter second line means you're going to have to rewrite every byte that follows in the file since you've shifted the positions. You can't rewrite just the line that has changed. That's the advantage of a database with fixed-length fields. They can be rewritten in place cheaply. Since you're already reading and writing every byte, you might as well use another file. There's no additional I/O cost, just disk space. You could potentially rewrite every byte in the first file by reading a block, and writing the new block. But that means during the operation your file is inconsistent. If the program were to crash, you'd have a file that was half old and half new. If you leave them the same length, then it should be possible to a replacement write. Uncommon, but we should be able to find an example. RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020 Yes as I said in my previous post, its ok with me to replace a line with a string of the same name. The files are huge so the additional disk I/O from the third file would slow things a lot. Can you suggest code to replace the string in that line with another one of same length cheaply? RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020 Yes, just taking a while. The simple way is to do a text read, but this is actually quite fragile because the length of a character is not fixed. Any non-ascii characters in the file and this will break. You can get unchanged data, or even invalid UTF-8 that can't be read as text. You'd need to change this instead to a binary read and bytes operations rather than str operations. But this at least shows the principle. modify.txt before running:
with open ("modify.txt", "r+") as f: line_start = f.tell() line = f.readline() while line: x = line.find("3") # replace any line with a 3 in it. if x < 0: pass else: l = len(line) print (f"found line3 starting on offset {line_start}") # The line is exactly l characters long (with newlines) replace = "X" * (l-1) # change to whatever you want to replace with. f.seek(line_start) f.write(replace) # f.write only overwrites the characters, not the newline line_start = f.tell() line = f.readline()After running:
RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020 (Aug-10-2020, 05:22 PM)bowlofred Wrote: Yes, just taking a while. The simple way is to do a text read, but this is actually quite fragile because the length of a character is not fixed. Any non-ascii characters in the file and this will break. You can get unchanged data, or even invalid UTF-8 that can't be read as text. You'd need to change this instead to a binary read and bytes operations rather than str operations. But this at least shows the principle. So for my original example, I did the following: with open("file1", 'r') as fileA, open("file2", 'r+') as fileB: line_start = fileB.tell() for line1, line2 in zip(fileA, fileB): if line1 == line2: replace = "00" fileB.seek(line_start) fileB.write(replace)However this does not work as it only appends two zeroes in the EOF. Also if I invoke fileB.tell() within the for loop I get an error OSError: telling position disabled by next() call... RE: Iterate 2 large text files across lines and replace lines in second file - bowlofred - Aug-10-2020 Yes, this uses the filehandle as an iterator, and it causes problems for the tell(). I had to instead modify the loop that explicit readline() calls were used. RE: Iterate 2 large text files across lines and replace lines in second file - medatib531 - Aug-10-2020 Hmm so I'm confused. How would I do it? |