Jul-30-2020, 04:22 PM
(This post was last modified: Jul-30-2020, 04:22 PM by Robotguy. Edited 1 time in total.)

Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

import numpy as np import time import h5py import sys import csv f_r = h5py.File('input.h5', 'r+') dset1 = f_r.get('dataset_1') dset2 = f_r.get('dataset_2') r1,c1 = dset1.shape r2,c2 = dset2.shape left, right, count = 0,0,0 W = 4000 # Window half-width n = 1 # ********************************************** # HDF5 Out Creation # ********************************************** f_w = h5py.File('data.h5', 'w') d1 = np.zeros(shape=(0, 4)) dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True) for j in range(r1): e1 = dset1[j,1] # move left pointer so that is within -delta of e while left < r2 and dset2[left,1] - e1 <= -W: left += 1 # move right pointer so that is outside of +delta while right < r2 and dset2[right,1] - e1 <= W: right += 1 for i in range(left, right): delta = e1 - dset2[i,1] dset.resize(dset.shape[0] + n, axis=0) dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta] count += 1 print("\nFinal shape of dataset created: " + str(dset.shape)) f_w.close()