Fastest way to subtract elements of datasets of HDF5 file?

Fastest way to subtract elements of datasets of HDF5 file? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Fastest way to subtract elements of datasets of HDF5 file? (/thread-28699.html)

Fastest way to subtract elements of datasets of HDF5 file? - Robotguy - Jul-30-2020

Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

import numpy as np
import time
import h5py
import sys
import csv

f_r = h5py.File('input.h5', 'r+')

dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape

left, right, count = 0,0,0
W = 4000  # Window half-width
n = 1

# **********************************************
#   HDF5 Out Creation 
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)

for j in range(r1):
    e1 = dset1[j,1]

    # move left pointer so that is within -delta of e
    while left < r2 and dset2[left,1] - e1 <= -W:
        left += 1
    # move right pointer so that is outside of +delta
    while right < r2 and dset2[right,1] - e1 <= W:
        right += 1

    for i in range(left, right):
        delta = e1 - dset2[i,1]
        dset.resize(dset.shape[0] + n, axis=0)
        dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
        count += 1

print("\nFinal shape of dataset created: " + str(dset.shape))

f_w.close()

RE: Fastest way to subtract elements of datasets of HDF5 file? - scidam - Jul-31-2020

You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value?

Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory.

Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed...

So, is this way appropriate for you? What about your hardware?

RE: Fastest way to subtract elements of datasets of HDF5 file? - Robotguy - Jul-31-2020

Here are the answers to your questions:
You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value?
N (no. of rows reaches 10^9 for each file) contains float64 numbers

Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory.
That's right, we can't load the file all at one. I tried np.loadtxt and it failed!

Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed...

So, is this way appropriate for you? What about your hardware?

The subtraction has to be performed in Python. I know chunking could help. In fact I tried reading a chunk (N=10^6) from first file subtracting each of the chunk's element from the chunk of second file. But that still takes time as I have to grab each element of chunk of file-1 using for loop.

See my progress below; I used memory mapping as well. It is efficient if I do not do any subtraction and just go through the chunks. The "for j in range(m):" is the one that is inefficient. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of?

size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
                     names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))

for chunk1 in chunks1: # grab chunks from file-1
    m, _ = chunk1.shape  
    fp1[0:m,:] = chunk1
    chunks2 = pd.read_csv(filename[1], chunksize=size2,
                          names=['ch', 'tmstp', 'lt', 'rt'])
    for chunk2 in chunks2: # grab chunks from file-2
        k, _ = chunk2.shape  
        fp2[0:k, :] = chunk2

        for j in range(m): # Grabbing values from file-1's chunk
            e1 = fp1[j,1] 
            delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
            count += 1

        fp2.flush()
        a += k

    fp1.flush()
    del chunks2
    i += m
    prog_count += m

RE: Fastest way to subtract elements of datasets of HDF5 file? - scidam - Aug-01-2020

If you can rewrite for j in range(m) in numpy-vectorized form, it will work faster; e.g. something like this:
delta_mat = fp1[:, 1] - fp2[:, 2]; Below is an example (not tested), where I tried to compute difference between two huge arrays (that are given in txt/csv-format, as in your case):

common_size = 10 ** 6
N = 10 ** 9
filename = ["file-1.txt", "file-2.txt"]


chunks1 = pd.read_csv(filename[0], chunksize=common_size,
                     names=['c1', 'c2', 'lt', 'rt'])

chunks2 = pd.read_csv(filename[1], chunksize=common_size,
                     names=['ch', 'tmstp', 'lt', 'rt'])

output = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(N, 4))
 
 
# Example: column-wise difference, i.e. ch - c1, tmstp - c2, lt - lt, rt - rt
# output is stored to newfile1.dat
for ind, (chunk1, chunk2) in enumerate(zip(chunks1, chunks2)):
    output[common_size * ind : common_size * (ind + 1), :] =  chunks2.values - chunks1.values
    output.flush()
    
# It may cause an error,  if file-1 and file-2 have different number of rows.