Python Forum - Search for multiple unknown 3 (2) Byte combinations in a file.

Hi all

I` d like to search a file for multiple unknown hex combinations in a file and have absolutely no idea how to start.

The hex combinations are always 3 byte long and the second and third byte will have different unknown values.

What i know is that the initial combinations to search start with one of the following

8c , 8d , 8e

so ac???? or 8d???? 8e????

Now i want to return the ???? values only if there is also one or multiple ac???? or 8d???? 8e???? or ee???? or ce???? in that file aswell.

For example if 8c21d0 is found anywhere in that file then seach for a second,third... 8c21d0 and also for
ee21d0 8d21d0 8e21d0 ce21d0

Once that is found i`d like to skip returning values if the found bytes match e.g. 8d 12 d0 or 8d 22 d0 or 8d 00 04 etc.

I think to store the skip values in an list is the best solution here.

If the 2nd and third byte is not in that skiplist the function should return 21d0 in that case as first found combinations, but
continue searching for other possible matches and return them aswell.

What would be the best solution / funtion to start such a Project ?

Are they hex strings? If so, what is the delimiter?

(Aug-12-2023, 06:00 PM)deanhystad Wrote: [ -> ]Are they hex strings? If so, what is the delimiter?

no delimiter, raw Binary files , max 1MB in size

(Aug-12-2023, 06:06 PM)lastyle Wrote: [ -> ]
(Aug-12-2023, 06:00 PM)deanhystad Wrote: [ -> ]Are they hex strings? If so, what is the delimiter?

no delimiter, raw Binary files , max 1MB in size

Probably using the Range function Up to 0xff 0xff and then search with that could be Something to start with ?

Q?: Can the 8b/8c/8d appear anywhere, or only at multiple of 3 byte offsets?

Here's a way to convert an array of bytes into 24bit ints.

https://stackoverflow.com/questions/1208...1#34128171

Depending on the answer to my question you might need to do this 3 times to get all possible 24 bit integers. Once you have that, look for numbers int he range 0x8C0000 to 0x8F0000. This can be done quickly using numpy. Like this:

https://stackoverflow.com/questions/4503...nge-python

And you can use numpy.unique() to get a count of each unique value.

https://numpy.org/doc/stable/reference/g...nique.html

(Aug-12-2023, 08:15 PM)deanhystad Wrote: [ -> ]Can the 8b/8c/8d appear anywhere, or only at multiple of 3 byte offsets?

Can appear anywhere and also Multiple Times, or not even once.

This is my first pass.

import numpy as np
from numpy.lib.stride_tricks import as_strided

def get_int24(bytes_):
    """Convert bytes to 24bit ints."""
    last_byte = bytes_.shape[0] - bytes_.shape[0] % 4
    bytes_ = bytes_[:last_byte]
    count = last_byte // 3
    int24 = as_strided(bytes_.view(np.uint32), strides=(3,), shape=(count,))
    return int24 & 0x00ffffff

# Load file and convert to 24bit ints.
bytes = np.memmap('test.txt', dtype=np.dtype('u1'), mode='r')
int24 = get_int24(bytes)

# Throw away values that are not in range x8A000...0x8F0000
inrange = int24[(int24 >= 0x8A0000) & (int24 < 0x8F0000)]

# Get counts for each value.  Save as tuple (count, hex value)
counts = [(count, hex(value)) for value, count in zip(*np.unique((inrange), return_counts=True))]

print(sorted(counts, reverse=True)[:10])

I am still unclear about the alignment of the data in the file. This code assumes the file consists of 24bit integers, so each integer startes on a 3 byte boundary.

Output:Bytes:  00 8C 00 8D 00 00
Offset: 0  1  2  3  4  5

My assumption is that 8C 00 8D is not a match because it does not start on a 3 byte boundary. 8D 00 00 is a match because it does.

This solution is brittle. It only works if the file length is a multiple of 3 and 4. The upcast from bytes to in requires 4 bytes. If there are only 3 bytes remaining the upcast will fail. To solve, I think you have to make the last 3 bytes a special case. Since your file is only a megabyte, it might not be worth the hassle. Reading the bytes and using int.from_bytes() might be fast enough. Something like this:

def int24(bytes_, endian='little'):
    """Convert bytes to 24bit ints.  Return numpy array of ints."""
    return np.array([int.from_bytes(bytes_[x:x+3], endian) for x in range(0, len(bytes_), 3)])

This only takes about a second to process a 1Mbyte file.

This is quick and robust (I think). It uses numpy reshape() and concatenate() to pad the 24 bit integers to 32 bits. Probably not as quick as using as_strided(), but only takes 0.006 seconds to process a 1Mbyte file.

import numpy as np
import sys

def int24(bytes_):
    """Convert bytes to 24bit ints.  Return numpy array of ints."""
    # How many 3 byte ints are in bytes_?
    count = bytes_.shape[0] // 3

    # Reshape bytes_ into 3 byte arrays.
    bytes_= bytes_[:count*3].reshape((count, 3))

    # Pad with zeros to make 4 byte arrays
    if sys.byteorder == "little":
        padded = np.concatenate((bytes_, np.zeros((count, 1), dtype=np.uint8)), axis=1)
    else:
        padded = np.concatenate((np.zeros((count, 1), dtype=np.uint8), bytes_), axis=1)

    # Convert 4 byte arrays to 4 byte ints
    return np.frombuffer(padded.tobytes(), dtype=np.uint32)


# Load file and convert to 24bit ints.
bytes_ = np.fromfile('test.txt', dtype=np.uint8)
asints = int24(bytes_)

# Throw away values that are not in range x8A000...0x8F0000
inrange = asints[(asints >= 0x8A0000) & (asints < 0x8F0000)]
 
# Get counts for each value.  Save as tuple (count, hex value)
counts = [(count, hex(value)) for value, count in zip(*np.unique((inrange), return_counts=True))]
 
print(sorted(counts, reverse=True)[:10])

And if the 8C/8D/8E can be anywhere in the file, at any offset, just shift the bytes_ array and resample.

# Load file and convert to 24bit ints.  Shift the
# starting point to get all 24 bit ints.
bytes_ = np.fromfile('test.txt', dtype=np.uint8)
asints = np.concatenate(
    (int24(bytes_), int24(bytes_[1:]), int24(bytes_[2:]))
)