Python Forum
Fastest way to subtract elements of datasets of HDF5 file?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fastest way to subtract elements of datasets of HDF5 file?
#1
Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

import numpy as np
import time
import h5py
import sys
import csv

f_r = h5py.File('input.h5', 'r+')

dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape

left, right, count = 0,0,0
W = 4000  # Window half-width
n = 1

# **********************************************
#   HDF5 Out Creation 
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)

for j in range(r1):
    e1 = dset1[j,1]

    # move left pointer so that is within -delta of e
    while left < r2 and dset2[left,1] - e1 <= -W:
        left += 1
    # move right pointer so that is outside of +delta
    while right < r2 and dset2[right,1] - e1 <= W:
        right += 1

    for i in range(left, right):
        delta = e1 - dset2[i,1]
        dset.resize(dset.shape[0] + n, axis=0)
        dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
        count += 1

print("\nFinal shape of dataset created: " + str(dset.shape))

f_w.close()
Reply


Messages In This Thread
Fastest way to subtract elements of datasets of HDF5 file? - by Robotguy - Jul-30-2020, 04:22 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Printing file path of lift elements dyerlee91 1 1,516 Sep-27-2021, 01:22 PM
Last Post: snippsat
  [solved] Save a matplotlib figure into hdf5 file paul18fr 1 2,522 Jun-08-2021, 05:58 PM
Last Post: paul18fr
  How to subtract columns with dates? jpy 3 2,263 Dec-29-2020, 12:11 AM
Last Post: jpy
  Accessing details of chunks in HDF5 file Robotguy 0 1,574 Aug-29-2020, 06:51 AM
Last Post: Robotguy
  How to sort a HDF5 file Robotguy 1 3,085 Jul-23-2020, 05:34 PM
Last Post: DeaD_EyE
  Datasets lErn1324 1 1,521 Jul-17-2020, 06:29 PM
Last Post: Larz60+
  Formula with elements of list - If-condition regarding the lists elements lewielewis 2 2,740 May-08-2020, 01:41 PM
Last Post: nnk
  Datasets of grammatically uncommon sentences? regstuff 3 2,205 Nov-03-2019, 07:02 PM
Last Post: Larz60+
  Groupby in pandas with conditional - add and subtract rregorr 2 6,971 Jul-12-2019, 05:17 PM
Last Post: rregorr
  Subtract rows (like r[1]-r[2] and r[3]-r[3]) and no pandas pradeepkumarbe 1 2,606 Dec-18-2018, 01:16 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020