Aug-13-2023, 02:04 PM
(This post was last modified: Aug-13-2023, 02:04 PM by deanhystad.)
This is my first pass.
This solution is brittle. It only works if the file length is a multiple of 3 and 4. The upcast from bytes to in requires 4 bytes. If there are only 3 bytes remaining the upcast will fail. To solve, I think you have to make the last 3 bytes a special case. Since your file is only a megabyte, it might not be worth the hassle. Reading the bytes and using int.from_bytes() might be fast enough. Something like this:
import numpy as np from numpy.lib.stride_tricks import as_strided def get_int24(bytes_): """Convert bytes to 24bit ints.""" last_byte = bytes_.shape[0] - bytes_.shape[0] % 4 bytes_ = bytes_[:last_byte] count = last_byte // 3 int24 = as_strided(bytes_.view(np.uint32), strides=(3,), shape=(count,)) return int24 & 0x00ffffff # Load file and convert to 24bit ints. bytes = np.memmap('test.txt', dtype=np.dtype('u1'), mode='r') int24 = get_int24(bytes) # Throw away values that are not in range x8A000...0x8F0000 inrange = int24[(int24 >= 0x8A0000) & (int24 < 0x8F0000)] # Get counts for each value. Save as tuple (count, hex value) counts = [(count, hex(value)) for value, count in zip(*np.unique((inrange), return_counts=True))] print(sorted(counts, reverse=True)[:10])I am still unclear about the alignment of the data in the file. This code assumes the file consists of 24bit integers, so each integer startes on a 3 byte boundary.
Output:Bytes: 00 8C 00 8D 00 00
Offset: 0 1 2 3 4 5
My assumption is that 8C 00 8D is not a match because it does not start on a 3 byte boundary. 8D 00 00 is a match because it does.This solution is brittle. It only works if the file length is a multiple of 3 and 4. The upcast from bytes to in requires 4 bytes. If there are only 3 bytes remaining the upcast will fail. To solve, I think you have to make the last 3 bytes a special case. Since your file is only a megabyte, it might not be worth the hassle. Reading the bytes and using int.from_bytes() might be fast enough. Something like this:
def int24(bytes_, endian='little'): """Convert bytes to 24bit ints. Return numpy array of ints.""" return np.array([int.from_bytes(bytes_[x:x+3], endian) for x in range(0, len(bytes_), 3)])This only takes about a second to process a 1Mbyte file.