Fastest way to subtract elements of datasets of HDF5 file?

Robotguy · Jul-31-2020, 06:59 PM

Here are the answers to your questions:
You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value?
N (no. of rows reaches 10^9 for each file) contains float64 numbers

Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory.
That's right, we can't load the file all at one. I tried np.loadtxt and it failed!

Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed...

So, is this way appropriate for you? What about your hardware?

The subtraction has to be performed in Python. I know chunking could help. In fact I tried reading a chunk (N=10^6) from first file subtracting each of the chunk's element from the chunk of second file. But that still takes time as I have to grab each element of chunk of file-1 using for loop.

See my progress below; I used memory mapping as well. It is efficient if I do not do any subtraction and just go through the chunks. The "for j in range(m):" is the one that is inefficient. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of?

size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
                     names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))

for chunk1 in chunks1: # grab chunks from file-1
    m, _ = chunk1.shape  
    fp1[0:m,:] = chunk1
    chunks2 = pd.read_csv(filename[1], chunksize=size2,
                          names=['ch', 'tmstp', 'lt', 'rt'])
    for chunk2 in chunks2: # grab chunks from file-2
        k, _ = chunk2.shape  
        fp2[0:k, :] = chunk2

        for j in range(m): # Grabbing values from file-1's chunk
            e1 = fp1[j,1] 
            delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
            count += 1

        fp2.flush()
        a += k

    fp1.flush()
    del chunks2
    i += m
    prog_count += m

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Printing file path of lift elements	dyerlee91	1	1,629	Sep-27-2021, 01:22 PM Last Post: snippsat
	[solved] Save a matplotlib figure into hdf5 file	paul18fr	1	2,675	Jun-08-2021, 05:58 PM Last Post: paul18fr
	How to subtract columns with dates?	jpy	3	2,384	Dec-29-2020, 12:11 AM Last Post: jpy
	Accessing details of chunks in HDF5 file	Robotguy	0	1,625	Aug-29-2020, 06:51 AM Last Post: Robotguy
	How to sort a HDF5 file	Robotguy	1	3,198	Jul-23-2020, 05:34 PM Last Post: DeaD_EyE
	Datasets	lErn1324	1	1,598	Jul-17-2020, 06:29 PM Last Post: Larz60+
	Formula with elements of list - If-condition regarding the lists elements	lewielewis	2	2,880	May-08-2020, 01:41 PM Last Post: nnk
	Datasets of grammatically uncommon sentences?	regstuff	3	2,312	Nov-03-2019, 07:02 PM Last Post: Larz60+
	Groupby in pandas with conditional - add and subtract	rregorr	2	7,101	Jul-12-2019, 05:17 PM Last Post: rregorr
	Subtract rows (like r[1]-r[2] and r[3]-r[3]) and no pandas	pradeepkumarbe	1	2,691	Dec-18-2018, 01:16 PM Last Post: ichabod801

Fastest way to subtract elements of datasets of HDF5 file?

User Panel Messages

Announcements