Feb-02-2019, 02:07 AM
i have a bunch (2823) of large files typically around 4GB in size but none larger than 2**33-1 (8589934591) bytes in size. these files have typically 100's of MB to a few GB of actual data followed by zero padding out to the end of the file.
i need to truncate all of these files to remove the zero padding without losing any data of non-zero bytes. because there is over 10 TB of data, i want to avoid reading all the files sequentially. but the data can have some sequences of zeros in it still followed by non-zero data, so a binary search is ruled out. my current thought is to read the files backwards, a page size, page aligned at a time an scan that page for any non-zero bytes to record the position of the last non-zero byte. i will just be recording these sizes for now and do the actual truncation at a later date (when i receive all the drives ... i just have a few samples for now).
is this something that is easy to do in Python or should i just do this in C?
i need to truncate all of these files to remove the zero padding without losing any data of non-zero bytes. because there is over 10 TB of data, i want to avoid reading all the files sequentially. but the data can have some sequences of zeros in it still followed by non-zero data, so a binary search is ruled out. my current thought is to read the files backwards, a page size, page aligned at a time an scan that page for any non-zero bytes to record the position of the last non-zero byte. i will just be recording these sizes for now and do the actual truncation at a later date (when i receive all the drives ... i just have a few samples for now).
is this something that is easy to do in Python or should i just do this in C?