Aug-19-2020, 05:35 PM
I am using PyTables-append to output the processed data. It is time efficient for large files (1-10 Gbs), at least better than resizing the HDF5 using h5py module!
However, in my case the output file (earray.h5) has a huge size for large files. Is there a way to append the data such that the output file is not that huge? For example, in my case (see image below) a 13 GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just the rows help? Any suggestions on this? Given below is a MWE.
![[Image: 1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv]](https://drive.google.com/drive/folders/1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv)
However, in my case the output file (earray.h5) has a huge size for large files. Is there a way to append the data such that the output file is not that huge? For example, in my case (see image below) a 13 GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just the rows help? Any suggestions on this? Given below is a MWE.
# no. of chunks from dset-1 and dset-2 in inp.h5 loop_1 = 40 loop_2 = 20 # save to disk after these many rows app_len = 10**6 # ********************************************** # Grabbing input.h5 file # ********************************************** filename = 'inp.h5' f2 = h5py.File(filename, 'r') chunks1 = f2['dset_1'] chunks2 = f2['dset_2'] shape1, shape2 = chunks1.shape[0], chunks2.shape[0] f1 = tables.open_file("table.h5", "w") a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4)) size1 = shape1//loop_1 size2 = shape2//loop_2 # *************************************************** # Grabbing chunks to process and append data # *************************************************** for c in range(loop_1): h = c*size1 # grab chunks from dset_1 of inp.h5 chunk1 = chunks1[h:(h + size1)] for d in range(loop_2): g = d*size2 chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5 r1 = chunk1.shape[0] r2 = chunk2.shape[0] left, right = 0, 0 for j in range(r1): # grab col.2 values from dataset-1 e1 = chunk1[j, 1] #...Algaebraic operations here to output a row containing 4 float64 #...append to a (earray) when no. of rows reach a million del chunk2 del chunk1 f2.close()