Python Forum
How to sort a HDF5 file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to sort a HDF5 file
#1
Hey everyone,

I am storing a large data file (10 GBs, N rows and 4 columns) in an HDF5 file using h5py package. Primarily because I do not want to use my RAM.

I would like to sort the items in the file based on second column. Any suggestions on how to do that?

Thanks!
Reply
#2
Chunks | Sort | Merge

Split the data of the hdf5 file into chunks.
Then sort this chunks and write the sorted chunks to disk.
Then open all chunk files and merge them.
Write the output into a different file.

You need heapq.merge which return a generator.

Here an example how it could work.
import os
import heapq
import random
from contextlib import ExitStack
from pathlib import Path


def producer(size):
    """
    Return some random integers between 0 and < 1024
    """
    return [random.randint(0, 1024) for _ in range(size)]


# you could use chunked from more_itertools
# https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.chunked
def chunker(iterable, chunksize):
    """
    Split a iterable into smaller chunks
    """
    return zip(*[iter(iterable)] * chunksize)


def sorter(iterable, filename):
    """
    Sort the chunks and save them into a files
    The files are defined by filename
    """
    for n, chunk in enumerate(iterable):
        chunk = "\n".join(map(str, sorted(chunk)))
        with open(f"{filename}_{n}", "w") as fd:
            fd.write(chunk)


def merger(filename, output):
    """
    Find all files related to filename_*
    Sort the files by last number
    Then open the files
    Merge the chunks and write it to output
    """
    key = lambda x: int(x.name.replace(f"{filename}_", ""))
    files = sorted(Path(filename).parent.glob(f"{filename}_*"), key=key)
    with ExitStack() as stack, open(output, "w") as fd_out:
        files = [stack.enter_context(file.open()) for file in files]
        map_to_int = [map(int, fd) for fd in files]
        for number in map(str, heapq.merge(*map_to_int)):
            fd_out.write(number)
            fd_out.write("\n")


sorter(chunker(producer(100), 10), "sorting")
merger("sorting", "result.txt")
By the way, do not try to save the output data into the source hdf5 file.
Make a new hdf5 file.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [solved] Save a matplotlib figure into hdf5 file paul18fr 1 2,513 Jun-08-2021, 05:58 PM
Last Post: paul18fr
  Accessing details of chunks in HDF5 file Robotguy 0 1,569 Aug-29-2020, 06:51 AM
Last Post: Robotguy
  Fastest way to subtract elements of datasets of HDF5 file? Robotguy 3 2,641 Aug-01-2020, 11:48 PM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020