Fastest way to subtract elements of datasets of HDF5 file? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Fastest way to subtract elements of datasets of HDF5 file? (/thread-28699.html) |
Fastest way to subtract elements of datasets of HDF5 file? - Robotguy - Jul-30-2020 Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file). Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file. Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner). import numpy as np import time import h5py import sys import csv f_r = h5py.File('input.h5', 'r+') dset1 = f_r.get('dataset_1') dset2 = f_r.get('dataset_2') r1,c1 = dset1.shape r2,c2 = dset2.shape left, right, count = 0,0,0 W = 4000 # Window half-width n = 1 # ********************************************** # HDF5 Out Creation # ********************************************** f_w = h5py.File('data.h5', 'w') d1 = np.zeros(shape=(0, 4)) dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True) for j in range(r1): e1 = dset1[j,1] # move left pointer so that is within -delta of e while left < r2 and dset2[left,1] - e1 <= -W: left += 1 # move right pointer so that is outside of +delta while right < r2 and dset2[right,1] - e1 <= W: right += 1 for i in range(left, right): delta = e1 - dset2[i,1] dset.resize(dset.shape[0] + n, axis=0) dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta] count += 1 print("\nFinal shape of dataset created: " + str(dset.shape)) f_w.close() RE: Fastest way to subtract elements of datasets of HDF5 file? - scidam - Jul-31-2020 You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value? Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory. Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed... So, is this way appropriate for you? What about your hardware? RE: Fastest way to subtract elements of datasets of HDF5 file? - Robotguy - Jul-31-2020 Here are the answers to your questions: You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value? N (no. of rows reaches 10^9 for each file) contains float64 numbers Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory. That's right, we can't load the file all at one. I tried np.loadtxt and it failed! Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed... So, is this way appropriate for you? What about your hardware? The subtraction has to be performed in Python. I know chunking could help. In fact I tried reading a chunk (N=10^6) from first file subtracting each of the chunk's element from the chunk of second file. But that still takes time as I have to grab each element of chunk of file-1 using for loop. See my progress below; I used memory mapping as well. It is efficient if I do not do any subtraction and just go through the chunks. The "for j in range(m):" is the one that is inefficient. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of? size1 = 100_000_0 size2 = 100_000_0 filename = ["file-1.txt", "file-2.txt"] chunks1 = pd.read_csv(filename[0], chunksize=size1, names=['c1', 'c2', 'lt', 'rt']) fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4)) fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4)) for chunk1 in chunks1: # grab chunks from file-1 m, _ = chunk1.shape fp1[0:m,:] = chunk1 chunks2 = pd.read_csv(filename[1], chunksize=size2, names=['ch', 'tmstp', 'lt', 'rt']) for chunk2 in chunks2: # grab chunks from file-2 k, _ = chunk2.shape fp2[0:k, :] = chunk2 for j in range(m): # Grabbing values from file-1's chunk e1 = fp1[j,1] delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2 count += 1 fp2.flush() a += k fp1.flush() del chunks2 i += m prog_count += m RE: Fastest way to subtract elements of datasets of HDF5 file? - scidam - Aug-01-2020 If you can rewrite for j in range(m) in numpy-vectorized form, it will work faster; e.g. something like this:delta_mat = fp1[:, 1] - fp2[:, 2] ; Below is an example (not tested), where I tried to compute difference between two huge arrays (that are given in txt/csv-format, as in your case):common_size = 10 ** 6 N = 10 ** 9 filename = ["file-1.txt", "file-2.txt"] chunks1 = pd.read_csv(filename[0], chunksize=common_size, names=['c1', 'c2', 'lt', 'rt']) chunks2 = pd.read_csv(filename[1], chunksize=common_size, names=['ch', 'tmstp', 'lt', 'rt']) output = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(N, 4)) # Example: column-wise difference, i.e. ch - c1, tmstp - c2, lt - lt, rt - rt # output is stored to newfile1.dat for ind, (chunk1, chunk2) in enumerate(zip(chunks1, chunks2)): output[common_size * ind : common_size * (ind + 1), :] = chunks2.values - chunks1.values output.flush() # It may cause an error, if file-1 and file-2 have different number of rows. |