Python Forum

Full Version: Pytables: Reducing size of an appended Earray
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am using PyTables-append to output the processed data. It is time efficient for large files (1-10 Gbs), at least better than resizing the HDF5 using h5py module!

However, in my case the output file (earray.h5) has a huge size for large files. Is there a way to append the data such that the output file is not that huge? For example, in my case (see image below) a 13 GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.

I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just the rows help? Any suggestions on this? Given below is a MWE.

[Image: 1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv]

# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20 

# save to disk after these many rows
app_len = 10**6 

# **********************************************
#       Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]

f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))

size1 = shape1//loop_1
size2 = shape2//loop_2

# ***************************************************
#       Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
    h = c*size1
    # grab chunks from dset_1 of inp.h5  
    chunk1 = chunks1[h:(h + size1)]

    for d in range(loop_2):
        g = d*size2
        chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5 
        r1 = chunk1.shape[0]
        r2 = chunk2.shape[0]
        left, right = 0, 0

        for j in range(r1):  # grab col.2 values from dataset-1
            e1 = chunk1[j, 1]
            #...Algaebraic operations here to output a row containing 4 float64
            #...append to a (earray) when no. of rows reach a million
        del chunk2
    del chunk1
f2.close()