Python Forum
Pytables: Reducing size of an appended Earray
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Pytables: Reducing size of an appended Earray
#1
I am using PyTables-append to output the processed data. It is time efficient for large files (1-10 Gbs), at least better than resizing the HDF5 using h5py module!

However, in my case the output file (earray.h5) has a huge size for large files. Is there a way to append the data such that the output file is not that huge? For example, in my case (see image below) a 13 GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.

I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just the rows help? Any suggestions on this? Given below is a MWE.

[Image: 1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv]

# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20 

# save to disk after these many rows
app_len = 10**6 

# **********************************************
#       Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]

f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))

size1 = shape1//loop_1
size2 = shape2//loop_2

# ***************************************************
#       Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
    h = c*size1
    # grab chunks from dset_1 of inp.h5  
    chunk1 = chunks1[h:(h + size1)]

    for d in range(loop_2):
        g = d*size2
        chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5 
        r1 = chunk1.shape[0]
        r2 = chunk2.shape[0]
        left, right = 0, 0

        for j in range(r1):  # grab col.2 values from dataset-1
            e1 = chunk1[j, 1]
            #...Algaebraic operations here to output a row containing 4 float64
            #...append to a (earray) when no. of rows reach a million
        del chunk2
    del chunk1
f2.close()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pytables : Printing without special characters Robotguy 0 1,685 Nov-06-2020, 10:55 PM
Last Post: Robotguy
  Automating PyTables Dataset Creation and Append Robotguy 1 1,808 Oct-18-2020, 08:35 PM
Last Post: jefsummers
  Reading a table generated using PyTables Robotguy 1 1,540 Sep-18-2020, 03:10 PM
Last Post: Larz60+
  Using Pytables to Sort Large Arrays? Robotguy 0 2,011 Aug-12-2020, 03:35 PM
Last Post: Robotguy

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020