Pytables: Reducing size of an appended Earray

Robotguy · Aug-19-2020, 05:35 PM

I am using PyTables-append to output the processed data. It is time efficient for large files (1-10 Gbs), at least better than resizing the HDF5 using h5py module!

However, in my case the output file (earray.h5) has a huge size for large files. Is there a way to append the data such that the output file is not that huge? For example, in my case (see image below) a 13 GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.

I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just the rows help? Any suggestions on this? Given below is a MWE.

[Image: 1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv]

[Image: 1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv]

# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20 

# save to disk after these many rows
app_len = 10**6 

# **********************************************
#       Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]

f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))

size1 = shape1//loop_1
size2 = shape2//loop_2

# ***************************************************
#       Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
    h = c*size1
    # grab chunks from dset_1 of inp.h5  
    chunk1 = chunks1[h:(h + size1)]

    for d in range(loop_2):
        g = d*size2
        chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5 
        r1 = chunk1.shape[0]
        r2 = chunk2.shape[0]
        left, right = 0, 0

        for j in range(r1):  # grab col.2 values from dataset-1
            e1 = chunk1[j, 1]
            #...Algaebraic operations here to output a row containing 4 float64
            #...append to a (earray) when no. of rows reach a million
        del chunk2
    del chunk1
f2.close()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pytables : Printing without special characters	Robotguy	0	2,133	Nov-06-2020, 10:55 PM Last Post: Robotguy
	Automating PyTables Dataset Creation and Append	Robotguy	1	2,399	Oct-18-2020, 08:35 PM Last Post: jefsummers
	Reading a table generated using PyTables	Robotguy	1	2,264	Sep-18-2020, 03:10 PM Last Post: Larz60+
	Using Pytables to Sort Large Arrays?	Robotguy	0	2,475	Aug-12-2020, 03:35 PM Last Post: Robotguy

Pytables: Reducing size of an appended Earray

User Panel Messages

Announcements