Iterating Large Files

Robotguy · Jun-25-2020, 10:46 PM

Hello,

I am a beginner and will appreciate any alternatives to handle my problem. Simply put, I have two files, containing one vector each. Aim is to subtract all the elements of file 2 from file 1; for all possible combinations. Everything is fine for small vectors, everything is fine, but the processing time is huge for larger file such as with million elements in each file.

Given below is the minimal working example. I heard about memorymapping and would appreciate if you can share a modified version or any relevant pointers to handle this issue.

import matplotlib.pyplot as plt
import numpy as np

file1 = np.loadtxt('File_1.txt',delimiter=',')
file2 = np.loadtxt('File_2.txt',delimiter=',')
event1 = file1[:,1]
event2 = file2[:,1]

r1 = len(event1)
r2 = len(event2)

diff = []

for i in range(0,r1):

    for j in range(0,r2):
        delta = float(event1[i]-event2[j])

        if delta >=-4000 and delta <=4000:
            diff = np.append(diff, delta)
            

np.savetxt('export.txt', diff)

**Gribouillis** · (This post was last modified: Jun-26-2020, 10:03 AM by Gribouillis.)

Before going to memory mapping, which may not change much because the problem doesn't seem to be the time spent accessing the files, it seems to me that the mathematical algorithm could be improved by first sorting the arrays of numbers and using 3 pointers:

The first pointer is the current element x[i] in the first array of numbers.
The second pointer is the first element y[j] in the second array such that y[j] >= x[i] - 4000
The third pointer is the first element y[k] in the second array such that y[k] > x[i] + 4000

All the differences x[i] - y[n] must be appended to the result for n in range(j, k).

As i increases, j and k also increase, their new values could be determined by binary search in the second array.

Depending on your data, these algorithmic changes could dramatically cut the numbers of python loops and the number of numerical tests in the execution of the program.

Robotguy · Jul-15-2020, 08:54 PM

Hi,

So this works fine for few 100 MB file size. When my first and second array contains million or more events, I see Memory leaks due to append method. Any recommendations?

Thanks!

(Jun-26-2020, 10:00 AM)Gribouillis Wrote: Before going to memory mapping, which may not change much because the problem doesn't seem to be the time spent accessing the files, it seems to me that the mathematical algorithm could be improved by first sorting the arrays of numbers and using 3 pointers:
The first pointer is the current element x[i] in the first array of numbers.

The second pointer is the first element y[j] in the second array such that y[j] >= x[i] - 4000

The third pointer is the first element y[k] in the second array such that y[k] > x[i] + 4000

All the differences x[i] - y[n] must be appended to the result for n in range(j, k).

As i increases, j and k also increase, their new values could be determined by binary search in the second array.

Depending on your data, these algorithmic changes could dramatically cut the numbers of python loops and the number of numerical tests in the execution of the program.

**Gribouillis** · (This post was last modified: Jul-15-2020, 11:01 PM by Gribouillis.)

Robotguy Wrote:I see Memory leaks due to append method.

Well, the most obvious recommendation would be to write a file sequentially instead of filling a numpy array until there is a memory leak. I'm not a numpy expert but the strategy would be

with open('export.txt', 'wb') as ofh:
    diff = [] # this is a python list, not a numpy array
    for <iteration within the input>:
        diff.extend(...) # use list's extend() or append() methods which are fast
        if len(diff) > 10000: # save diff list and reset it when it becomes too long
            np.savetxt(ofh, diff)
            diff = []

It is probably not lightspeed but it can do the work for reasonably sized files. As an example the following python loop with 1 billion numbers take less than 2 minutes on my computer

>>> def f():
...     L = []
...     start = time.time()
...     for i in range(10**9):
...         L.append(i)
...         if len(L) >= 10000:
...             L = []
...     print(time.time() - start, 'seconds')

beware of numpy.append() which rewrites the whole array each time. Benchmark your procedures.

As for the file, I use sometimes the trick to write a file on a ramdisk, which is fast and doesn't need real disk access. By using this trick, you could perhaps write directly segments of numpy arrays to the file, such as in

np.savetxt(ofh, x[i] - y[j:k])

**Gribouillis** · (This post was last modified: Jul-16-2020, 07:11 AM by Gribouillis.)

It occurred to me that you can also play with BytesIO files in memory.

>>> import io
>>> import numpy as np
>>> ofh = io.BytesIO()
>>> a = np.linspace(0, 3, 100)
>>> np.savetxt(ofh, a[10:50])
>>> print(ofh.getvalue())

You could use them as temporary files with fast writing access and send them to the disk regularly when they become too large.

Robotguy · Jul-17-2020, 04:23 PM

Appreciate your persistence in answering my questions.

Though I understand populating the file, avoiding appending an array. Thinking further, I realize that I will need quick access to the (huge) data I save. Let me explain my situation thoroughly:

Consider two files (File-1.txt and File-2.txt); each of these files have 4 columns delimited by ',' (commas). My operations are carried out on column-2, I need to:

[a] sort each of the file based on column-2
[b] for each item in File-1's column-2, I have to find all elements in File-2's column such that the difference is within +/- 4000; right now I am using binary search as you mentioned
[c] save all those pairs (i.e. rows in File-1 and File-2) which satisfy +/- 4000 condition in (b)
[d] occasionally call few pairs from [c] based on some condition I set (produce histograms or whatever); having a text file in [c] is defeated here as I will frequently refer to the pairs and it should happen quickly!

The other catch is this whole process should be independent of file size (for files even more than tens of GBs). Granted that the algorithm will take more time for large files, but it should not quit giving RAM errors etc.

I found out h5py package and memmap could be a way to handle this. But it appears memmap has 2 GB limit ("Memory-mapped files cannot be larger than 2GB on 32-bit systems." as it says on: https://numpy.org/doc/stable/reference/g...emmap.html )

So, I think h5py should be a way to go. Any suggestions on h5py or something else can I implement?

Thanks again,

(Jul-15-2020, 11:01 PM)Gribouillis Wrote:
Robotguy Wrote:I see Memory leaks due to append method.
Well, the most obvious recommendation would be to write a file sequentially instead of filling a numpy array until there is a memory leak. I'm not a numpy expert but the strategy would be
with open('export.txt', 'wb') as ofh:
    diff = [] # this is a python list, not a numpy array
    for <iteration within the input>:
        diff.extend(...) # use list's extend() or append() methods which are fast
        if len(diff) > 10000: # save diff list and reset it when it becomes too long
            np.savetxt(ofh, diff)
            diff = []
It is probably not lightspeed but it can do the work for reasonably sized files. As an example the following python loop with 1 billion numbers take less than 2 minutes on my computer
>>> def f():
...     L = []
...     start = time.time()
...     for i in range(10**9):
...         L.append(i)
...         if len(L) >= 10000:
...             L = []
...     print(time.time() - start, 'seconds')
beware of numpy.append() which rewrites the whole array each time. Benchmark your procedures.

As for the file, I use sometimes the trick to write a file on a ramdisk, which is fast and doesn't need real disk access. By using this trick, you could perhaps write directly segments of numpy arrays to the file, such as in
np.savetxt(ofh, x[i] - y[j:k])

**Gribouillis** · (This post was last modified: Jul-18-2020, 08:47 AM by Gribouillis.)

Robotguy Wrote:I realize that I will need quick access to the (huge) data I save.

The first thing to clarify is how quick must the access be and how huge is the data file, especially the third file created by the others. Ram is typically small. On my computer, I have 16GB of Ram. Clearly, I wont be able to keep tens of GB in Ram at a time.

Se the main question is what do you want to do with the third file? You can still create histograms if the huge file is written on disk.

You could perhaps have a look at the PyTables module, based on HDF5, which was specially designed to handle large sets of data.

Robotguy · Jul-22-2020, 03:23 PM

(Jul-17-2020, 07:41 PM)Gribouillis Wrote: based on HDF5, which was specially designed to handle large sets of data.

So, I implemented something following your suggestion of using HDF5. A recap, here are the major steps:

[a] Load Files and sort each of the file based on column-2
[b] for each item in File-1's column-2, I have to find all elements in File-2's column such that the difference is within +/- 4000; right now I am using binary search as you mentioned
[c] save all those pairs (i.e. rows in File-1 and File-2) which satisfy +/- 4000 condition in (b)
[d] occasionally call few pairs from [c] based on some condition I set (produce histograms or whatever); having a text file in [c] is defeated here as I will frequently refer to the pairs and it should happen quickly!

I achieved [b]-[c] by creating a HDF5 file using h5py package. This appeared to be saving my time, thanks!

But now the issue is [a] for large files. For 1GB files in [a], I see a memory error due to loading and sorting these large files at once. Below is the snippet that is producing an error. As I mentioned these files could be tens of GBs.

My question now is to find an alternative, to load the file(s) data, sort them using some memory-efficient approach. (I know what to do after sorting using h5py) My thought is there should be a better way to implement the below snippet without burdening the RAM, any suggestions?

file1 = np.loadtxt('Device_1.txt', delimiter=',')
file2 = np.loadtxt('Device_2.txt', delimiter=',')

matrix1,matrix2 = file1[:], file2[:]

matrix1 = matrix1[np.argsort(matrix1[:, 1])]

**Gribouillis** · (This post was last modified: Jul-22-2020, 07:28 PM by Gribouillis.)

Here is a devilish suggestion. I made the following experiment

>>> s
array([9.94223926e-01, 7.55188959e-01, 2.87075284e-04, 6.60265593e-01,
       3.12176498e-02, 3.01580980e-01, 9.79960201e-01, 2.37826251e-01,
       1.74042656e-01, 1.39546100e-02, 2.14055048e-01, 8.73880775e-01,
       5.12656017e-01])
>>> filename = 'paillasse/foo.bin'
>>> Path(filename).write_bytes(s.tobytes())
104
>>> subprocess.call('bsort -k 8 -r 8 ' + filename, shell=True)
0
>>> ss = np.frombuffer(Path(filename).read_bytes())
>>> ss
array([2.87075284e-04, 3.12176498e-02, 9.79960201e-01, 2.37826251e-01,  # STILL WRONG, SEE BELOW
       9.94223926e-01, 5.12656017e-01, 6.60265593e-01, 8.73880775e-01,
       3.01580980e-01, 7.55188959e-01, 1.39546100e-02, 2.14055048e-01,
       1.74042656e-01])

In other words, I save the array as bytes in a file, I use the external progam bsort to perform inplace binary sort on this file
and then I retrieve the sorted numpy array by reading the file as bytes.

Bsort's reputation is to be extremely efficient and also it can presumably handle large files in the same way as Gnu's sort does.

This could be the solution of this problem.

OOPS, there is still an issue, it seems that the array is not yet sorted. It could be a byteorder problem.

YES! the following version works on wy machine

from pathlib import Path
import numpy as np
import subprocess

s = np.array([9.94223926e-01, 7.55188959e-01, 2.87075284e-04, 6.60265593e-01,
       3.12176498e-02, 3.01580980e-01, 9.79960201e-01, 2.37826251e-01,
       1.74042656e-01, 1.39546100e-02, 2.14055048e-01, 8.73880775e-01,
       5.12656017e-01])
filename = 'paillasse/foo.bin'

Path(filename).write_bytes(s.byteswap().tobytes())
subprocess.call(['bsort', '-k', '8', '-r', '8', filename])

ss = np.frombuffer(Path(filename).read_bytes()).byteswap()
print(s)
print(ss)
s.sort()
print('Sorted ?', np.array_equal(s, ss))

Output:[9.94223926e-01 7.55188959e-01 2.87075284e-04 6.60265593e-01
 3.12176498e-02 3.01580980e-01 9.79960201e-01 2.37826251e-01
 1.74042656e-01 1.39546100e-02 2.14055048e-01 8.73880775e-01
 5.12656017e-01]
[2.87075284e-04 1.39546100e-02 3.12176498e-02 1.74042656e-01
 2.14055048e-01 2.37826251e-01 3.01580980e-01 5.12656017e-01
 6.60265593e-01 7.55188959e-01 8.73880775e-01 9.79960201e-01
 9.94223926e-01]
Sorted ? True

Robotguy · (This post was last modified: Jul-22-2020, 08:47 PM by Robotguy.)

(Jul-22-2020, 06:09 PM)Gribouillis Wrote: s = np.array([9.94223926e-01, 7.55188959e-01, 2.87075284e-04, 6.60265593e-01,
       3.12176498e-02, 3.01580980e-01, 9.79960201e-01, 2.37826251e-01,
       1.74042656e-01, 1.39546100e-02, 2.14055048e-01, 8.73880775e-01,
       5.12656017e-01])

Correct me if I am wrong: this should give an issue as my array belongs to a large text file (10 GB, Nx4)? First question is how do I load my file and define in s? np.loadtxt is failing giving memory error, can we use h5py (or memmap) to load all the data to hard disk at once and sort based on column 2?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Iterate 2 large text files across lines and replace lines in second file	medatib531	13	5,825	Aug-10-2020, 11:01 PM Last Post: medatib531
	Handling Large XML Files (>10GB) in Python	onlydibs	1	4,192	Dec-22-2019, 05:46 AM Last Post: Clunk_Head
	Segmentation fault with large files	kusal1	3	2,752	Oct-01-2019, 07:32 AM Last Post: Gribouillis
	Compare two large CSV files for a match	Python_Newbie9	3	5,793	Apr-22-2019, 08:49 PM Last Post: ichabod801
	Comparing values in large txt files	StevenVF	2	2,738	Feb-28-2019, 09:07 AM Last Post: StevenVF
	Download multiple large json files at once	halcynthis	0	2,781	Feb-14-2019, 08:41 AM Last Post: halcynthis
	iterating over files	clarablanes	17	7,226	Aug-30-2018, 02:18 PM Last Post: clarablanes

Iterating Large Files

User Panel Messages

Announcements