Feb-02-2019, 02:07 AM
i have a bunch (2823) of large files typically around 4GB in size but none larger than 2**33-1 (8589934591) bytes in size. these files have typically 100's of MB to a few GB of actual data followed by zero padding out to the end of the file.
i need to truncate all of these files to remove the zero padding without losing any data of non-zero bytes. because there is over 10 TB of data, i want to avoid reading all the files sequentially. but the data can have some sequences of zeros in it still followed by non-zero data, so a binary search is ruled out. my current thought is to read the files backwards, a page size, page aligned at a time an scan that page for any non-zero bytes to record the position of the last non-zero byte. i will just be recording these sizes for now and do the actual truncation at a later date (when i receive all the drives ... i just have a few samples for now).
is this something that is easy to do in Python or should i just do this in C?
i need to truncate all of these files to remove the zero padding without losing any data of non-zero bytes. because there is over 10 TB of data, i want to avoid reading all the files sequentially. but the data can have some sequences of zeros in it still followed by non-zero data, so a binary search is ruled out. my current thought is to read the files backwards, a page size, page aligned at a time an scan that page for any non-zero bytes to record the position of the last non-zero byte. i will just be recording these sizes for now and do the actual truncation at a later date (when i receive all the drives ... i just have a few samples for now).
is this something that is easy to do in Python or should i just do this in C?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.