Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Iterating Large Files
#6
Appreciate your persistence in answering my questions.

Though I understand populating the file, avoiding appending an array. Thinking further, I realize that I will need quick access to the (huge) data I save. Let me explain my situation thoroughly:

Consider two files (File-1.txt and File-2.txt); each of these files have 4 columns delimited by ',' (commas). My operations are carried out on column-2, I need to:

[a] sort each of the file based on column-2
[b] for each item in File-1's column-2, I have to find all elements in File-2's column such that the difference is within +/- 4000; right now I am using binary search as you mentioned
[c] save all those pairs (i.e. rows in File-1 and File-2) which satisfy +/- 4000 condition in (b)
[d] occasionally call few pairs from [c] based on some condition I set (produce histograms or whatever); having a text file in [c] is defeated here as I will frequently refer to the pairs and it should happen quickly!

The other catch is this whole process should be independent of file size (for files even more than tens of GBs). Granted that the algorithm will take more time for large files, but it should not quit giving RAM errors etc.

I found out h5py package and memmap could be a way to handle this. But it appears memmap has 2 GB limit ("Memory-mapped files cannot be larger than 2GB on 32-bit systems." as it says on: https://numpy.org/doc/stable/reference/g...emmap.html )

So, I think h5py should be a way to go. Any suggestions on h5py or something else can I implement?

Thanks again,

(Jul-15-2020, 11:01 PM)Gribouillis Wrote:
Robotguy Wrote:I see Memory leaks due to append method.
Well, the most obvious recommendation would be to write a file sequentially instead of filling a numpy array until there is a memory leak. I'm not a numpy expert but the strategy would be
with open('export.txt', 'wb') as ofh:
    diff = [] # this is a python list, not a numpy array
    for <iteration within the input>:
        diff.extend(...) # use list's extend() or append() methods which are fast
        if len(diff) > 10000: # save diff list and reset it when it becomes too long
            np.savetxt(ofh, diff)
            diff = []
It is probably not lightspeed but it can do the work for reasonably sized files. As an example the following python loop with 1 billion numbers take less than 2 minutes on my computer

>>> def f():
...     L = []
...     start = time.time()
...     for i in range(10**9):
...         L.append(i)
...         if len(L) >= 10000:
...             L = []
...     print(time.time() - start, 'seconds')
beware of numpy.append() which rewrites the whole array each time. Benchmark your procedures.

As for the file, I use sometimes the trick to write a file on a ramdisk, which is fast and doesn't need real disk access. By using this trick, you could perhaps write directly segments of numpy arrays to the file, such as in
np.savetxt(ofh, x[i] - y[j:k])
Reply


Messages In This Thread
Iterating Large Files - by Robotguy - Jun-25-2020, 10:46 PM
RE: Iterating Large Files - by Gribouillis - Jun-26-2020, 10:00 AM
RE: Iterating Large Files - by Robotguy - Jul-15-2020, 08:54 PM
RE: Iterating Large Files - by Gribouillis - Jul-15-2020, 11:01 PM
RE: Iterating Large Files - by Robotguy - Jul-17-2020, 04:23 PM
RE: Iterating Large Files - by Gribouillis - Jul-16-2020, 07:11 AM
RE: Iterating Large Files - by Gribouillis - Jul-17-2020, 07:41 PM
RE: Iterating Large Files - by Robotguy - Jul-22-2020, 03:23 PM
RE: Iterating Large Files - by Gribouillis - Jul-22-2020, 06:09 PM
RE: Iterating Large Files - by Robotguy - Jul-22-2020, 08:46 PM
RE: Iterating Large Files - by Gribouillis - Jul-22-2020, 09:13 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Iterate 2 large text files across lines and replace lines in second file medatib531 13 6,107 Aug-10-2020, 11:01 PM
Last Post: medatib531
  Handling Large XML Files (>10GB) in Python onlydibs 1 4,264 Dec-22-2019, 05:46 AM
Last Post: Clunk_Head
  Segmentation fault with large files kusal1 3 2,833 Oct-01-2019, 07:32 AM
Last Post: Gribouillis
  Compare two large CSV files for a match Python_Newbie9 3 5,863 Apr-22-2019, 08:49 PM
Last Post: ichabod801
  Comparing values in large txt files StevenVF 2 2,792 Feb-28-2019, 09:07 AM
Last Post: StevenVF
  Download multiple large json files at once halcynthis 0 2,825 Feb-14-2019, 08:41 AM
Last Post: halcynthis
  iterating over files clarablanes 17 7,421 Aug-30-2018, 02:18 PM
Last Post: clarablanes

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020