Python Forum

i have a bunch (2823) of large files typically around 4GB in size but none larger than 2**33-1 (8589934591) bytes in size. these files have typically 100's of MB to a few GB of actual data followed by zero padding out to the end of the file.

i need to truncate all of these files to remove the zero padding without losing any data of non-zero bytes. because there is over 10 TB of data, i want to avoid reading all the files sequentially. but the data can have some sequences of zeros in it still followed by non-zero data, so a binary search is ruled out. my current thought is to read the files backwards, a page size, page aligned at a time an scan that page for any non-zero bytes to record the position of the last non-zero byte. i will just be recording these sizes for now and do the actual truncation at a later date (when i receive all the drives ... i just have a few samples for now).

is this something that is easy to do in Python or should i just do this in C?

actually when reading backwards, you can delete data at end of file without messing up indexes, but of course can't do that if reading from front of file.

i will just read the file until i know what size it is and construct a script for each year that runs the truncate command on each day (one file per day). each file is unchanged until well after i have determined the size each should be.

>>> from pathlib import Path
>>> home = Path('.')
>>> Noc = home / 'NocturneinC-sharpMinorFredericChopinPiano.mp4'
>>> Noc.stat().st_size
9017139
>>>

(Feb-02-2019, 10:13 AM)Larz60+ Wrote: [ -> ]

>>> from pathlib import Path
>>> home = Path('.')
>>> Noc = home / 'NocturneinC-sharpMinorFredericChopinPiano.mp4'
>>> Noc.stat().st_size
9017139
>>>

that's how big the file is. now run a script that generates a random number from 64 to 1048576 (not necessarily a power of 2). append a random number of zero-bytes to the file. the file is now larger but the data is still all there. now the script i need to make would be able to determine the approximate size of the data (exact size if the data happened to end in a non-zero byte).

suppose the original file's last byte is 0x46. suppose the random number happens to be 5417. there will be 5417 bytes appended, changing the above results to 9022556. the length of the file will be 9022556 but the length to be determined is 9017139. 9017139 is the length of the data. the posix command truncate -s 9017139 ... applied to the appended file would be able to restore the file to its original size. but you need to know the value 9017139. if all you get is the file with appended zero bytes, you won't know until you read the file contents and look for that last non-zero byte. the program i need to create is going to determine this for all the files but not (yet) truncate any of them.

if any of the original files happen to end in a zero byte (0x00) or a run of them, then the original size cannot be determined. but it is known that there will be a non-zero byte in the last 64 bytes. the units of this data are not larger than 64 and have a non-zero length prefix (e.g. its length counts itself).

Skaperen

Larz60+

Skaperen

Larz60+

Skaperen