Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 how many bytes in a file before zero padding
#1
i have a bunch (2823) of large files typically around 4GB in size but none larger than 2**33-1 (8589934591) bytes in size. these files have typically 100's of MB to a few GB of actual data followed by zero padding out to the end of the file.

i need to truncate all of these files to remove the zero padding without losing any data of non-zero bytes. because there is over 10 TB of data, i want to avoid reading all the files sequentially. but the data can have some sequences of zeros in it still followed by non-zero data, so a binary search is ruled out. my current thought is to read the files backwards, a page size, page aligned at a time an scan that page for any non-zero bytes to record the position of the last non-zero byte. i will just be recording these sizes for now and do the actual truncation at a later date (when i receive all the drives ... i just have a few samples for now).

is this something that is easy to do in Python or should i just do this in C?
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Quote
#2
actually when reading backwards, you can delete data at end of file without messing up indexes, but of course can't do that if reading from front of file.
Quote
#3
i will just read the file until i know what size it is and construct a script for each year that runs the truncate command on each day (one file per day). each file is unchanged until well after i have determined the size each should be.
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Quote
#4
>>> from pathlib import Path
>>> home = Path('.')
>>> Noc = home / 'NocturneinC-sharpMinorFredericChopinPiano.mp4'
>>> Noc.stat().st_size
9017139
>>>
Quote
#5
(Feb-02-2019, 10:13 AM)Larz60+ Wrote:
>>> from pathlib import Path
>>> home = Path('.')
>>> Noc = home / 'NocturneinC-sharpMinorFredericChopinPiano.mp4'
>>> Noc.stat().st_size
9017139
>>>
that's how big the file is. now run a script that generates a random number from 64 to 1048576 (not necessarily a power of 2). append a random number of zero-bytes to the file. the file is now larger but the data is still all there. now the script i need to make would be able to determine the approximate size of the data (exact size if the data happened to end in a non-zero byte).

suppose the original file's last byte is 0x46. suppose the random number happens to be 5417. there will be 5417 bytes appended, changing the above results to 9022556. the length of the file will be 9022556 but the length to be determined is 9017139. 9017139 is the length of the data. the posix command truncate -s 9017139 ... applied to the appended file would be able to restore the file to its original size. but you need to know the value 9017139. if all you get is the file with appended zero bytes, you won't know until you read the file contents and look for that last non-zero byte. the program i need to create is going to determine this for all the files but not (yet) truncate any of them.

if any of the original files happen to end in a zero byte (0x00) or a run of them, then the original size cannot be determined. but it is known that there will be a non-zero byte in the last 64 bytes. the units of this data are not larger than 64 and have a non-zero length prefix (e.g. its length counts itself).
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  understanding output of bytes/raw data rootVIII 3 189 Aug-01-2019, 01:00 PM
Last Post: rootVIII
  pickle docs say bytes in one place, strings in another Skaperen 2 85 Jul-29-2019, 05:13 PM
Last Post: Skaperen
  printing a bytes string Skaperen 2 93 Jul-21-2019, 03:42 AM
Last Post: Skaperen
  bytes-like object is required, not 'str anna 2 360 May-15-2019, 07:01 AM
Last Post: anna
  Getting error "Type error-a bytes-like object..." mrapple2020 1 353 Apr-06-2019, 06:37 PM
Last Post: mrapple2020
  replace bytes with other byte or bytes BigOldArt 1 323 Feb-02-2019, 11:00 PM
Last Post: snippsat
  builtins.TypeError: a bytes-like object is required, not 'str' BigOldArt 0 846 Jan-31-2019, 10:46 PM
Last Post: BigOldArt
  # of bytes used to store a Unicode character insearchofanswers87 3 287 Jan-19-2019, 04:01 PM
Last Post: ichabod801
  Bytes from hex FredericoFagundes 2 558 Jan-17-2019, 04:36 PM
Last Post: FredericoFagundes
  chr() for bytes Skaperen 10 2,489 Sep-26-2018, 08:51 PM
Last Post: Skaperen

Forum Jump:


Users browsing this thread: 1 Guest(s)