Python Forum
Search for multiple unknown 3 (2) Byte combinations in a file.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Search for multiple unknown 3 (2) Byte combinations in a file.
#1
Hi all

I` d like to search a file for multiple unknown hex combinations in a file and have absolutely no idea how to start.

The hex combinations are always 3 byte long and the second and third byte will have different unknown values.

What i know is that the initial combinations to search start with one of the following

8c , 8d , 8e

so ac???? or 8d???? 8e????

Now i want to return the ???? values only if there is also one or multiple ac???? or 8d???? 8e???? or ee???? or ce???? in that file aswell.

For example if 8c21d0 is found anywhere in that file then seach for a second,third... 8c21d0 and also for
ee21d0 8d21d0 8e21d0 ce21d0

Once that is found i`d like to skip returning values if the found bytes match e.g. 8d 12 d0 or 8d 22 d0 or 8d 00 04 etc.

I think to store the skip values in an list is the best solution here.

If the 2nd and third byte is not in that skiplist the function should return 21d0 in that case as first found combinations, but
continue searching for other possible matches and return them aswell.

What would be the best solution / funtion to start such a Project ?
Reply
#2
Are they hex strings? If so, what is the delimiter?
Reply
#3
(Aug-12-2023, 06:00 PM)deanhystad Wrote: Are they hex strings? If so, what is the delimiter?

no delimiter, raw Binary files , max 1MB in size
Reply
#4
(Aug-12-2023, 06:06 PM)lastyle Wrote:
(Aug-12-2023, 06:00 PM)deanhystad Wrote: Are they hex strings? If so, what is the delimiter?

no delimiter, raw Binary files , max 1MB in size

Probably using the Range function Up to 0xff 0xff and then search with that could be Something to start with ?
Reply
#5
Q?: Can the 8b/8c/8d appear anywhere, or only at multiple of 3 byte offsets?

Here's a way to convert an array of bytes into 24bit ints.

https://stackoverflow.com/questions/1208...1#34128171

Depending on the answer to my question you might need to do this 3 times to get all possible 24 bit integers. Once you have that, look for numbers int he range 0x8C0000 to 0x8F0000. This can be done quickly using numpy. Like this:

https://stackoverflow.com/questions/4503...nge-python

And you can use numpy.unique() to get a count of each unique value.

https://numpy.org/doc/stable/reference/g...nique.html
Reply
#6
(Aug-12-2023, 08:15 PM)deanhystad Wrote: Can the 8b/8c/8d appear anywhere, or only at multiple of 3 byte offsets?

Can appear anywhere and also Multiple Times, or not even once.
Reply
#7
This is my first pass.
import numpy as np
from numpy.lib.stride_tricks import as_strided

def get_int24(bytes_):
    """Convert bytes to 24bit ints."""
    last_byte = bytes_.shape[0] - bytes_.shape[0] % 4
    bytes_ = bytes_[:last_byte]
    count = last_byte // 3
    int24 = as_strided(bytes_.view(np.uint32), strides=(3,), shape=(count,))
    return int24 & 0x00ffffff

# Load file and convert to 24bit ints.
bytes = np.memmap('test.txt', dtype=np.dtype('u1'), mode='r')
int24 = get_int24(bytes)

# Throw away values that are not in range x8A000...0x8F0000
inrange = int24[(int24 >= 0x8A0000) & (int24 < 0x8F0000)]

# Get counts for each value.  Save as tuple (count, hex value)
counts = [(count, hex(value)) for value, count in zip(*np.unique((inrange), return_counts=True))]

print(sorted(counts, reverse=True)[:10])
I am still unclear about the alignment of the data in the file. This code assumes the file consists of 24bit integers, so each integer startes on a 3 byte boundary.
Output:
Bytes: 00 8C 00 8D 00 00 Offset: 0 1 2 3 4 5
My assumption is that 8C 00 8D is not a match because it does not start on a 3 byte boundary. 8D 00 00 is a match because it does.

This solution is brittle. It only works if the file length is a multiple of 3 and 4. The upcast from bytes to in requires 4 bytes. If there are only 3 bytes remaining the upcast will fail. To solve, I think you have to make the last 3 bytes a special case. Since your file is only a megabyte, it might not be worth the hassle. Reading the bytes and using int.from_bytes() might be fast enough. Something like this:
def int24(bytes_, endian='little'):
    """Convert bytes to 24bit ints.  Return numpy array of ints."""
    return np.array([int.from_bytes(bytes_[x:x+3], endian) for x in range(0, len(bytes_), 3)])
This only takes about a second to process a 1Mbyte file.
Reply
#8
This is quick and robust (I think). It uses numpy reshape() and concatenate() to pad the 24 bit integers to 32 bits. Probably not as quick as using as_strided(), but only takes 0.006 seconds to process a 1Mbyte file.
import numpy as np
import sys

def int24(bytes_):
    """Convert bytes to 24bit ints.  Return numpy array of ints."""
    # How many 3 byte ints are in bytes_?
    count = bytes_.shape[0] // 3

    # Reshape bytes_ into 3 byte arrays.
    bytes_= bytes_[:count*3].reshape((count, 3))

    # Pad with zeros to make 4 byte arrays
    if sys.byteorder == "little":
        padded = np.concatenate((bytes_, np.zeros((count, 1), dtype=np.uint8)), axis=1)
    else:
        padded = np.concatenate((np.zeros((count, 1), dtype=np.uint8), bytes_), axis=1)

    # Convert 4 byte arrays to 4 byte ints
    return np.frombuffer(padded.tobytes(), dtype=np.uint32)


# Load file and convert to 24bit ints.
bytes_ = np.fromfile('test.txt', dtype=np.uint8)
asints = int24(bytes_)

# Throw away values that are not in range x8A000...0x8F0000
inrange = asints[(asints >= 0x8A0000) & (asints < 0x8F0000)]
 
# Get counts for each value.  Save as tuple (count, hex value)
counts = [(count, hex(value)) for value, count in zip(*np.unique((inrange), return_counts=True))]
 
print(sorted(counts, reverse=True)[:10])
And if the 8C/8D/8E can be anywhere in the file, at any offset, just shift the bytes_ array and resample.
# Load file and convert to 24bit ints.  Shift the
# starting point to get all 24 bit ints.
bytes_ = np.fromfile('test.txt', dtype=np.uint8)
asints = np.concatenate(
    (int24(bytes_), int24(bytes_[1:]), int24(bytes_[2:]))
)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Search Excel File with a list of values huzzug 4 1,266 Nov-03-2023, 05:35 PM
Last Post: huzzug
  search file by regex SamLiu 1 924 Feb-23-2023, 01:19 PM
Last Post: deanhystad
  Finding combinations of list of items (30 or so) LynnS 1 885 Jan-25-2023, 02:57 PM
Last Post: deanhystad
  If function is false search next file mattbatt84 2 1,162 Sep-04-2022, 01:56 PM
Last Post: deanhystad
  Python: re.findall to find multiple instances don't work but search worked Secret 1 1,231 Aug-30-2022, 08:40 PM
Last Post: deanhystad
  Search multiple CSV files for a string or strings cubangt 7 8,066 Feb-23-2022, 12:53 AM
Last Post: Pedroski55
  fuzzywuzzy search string in text file marfer 9 4,639 Aug-03-2021, 02:41 AM
Last Post: deanhystad
  How can I find all combinations with a regular expression? AlekseyPython 0 1,681 Jun-23-2021, 04:48 PM
Last Post: AlekseyPython
  Cloning a directory and using a .CSV file as a reference to search and replace bg25lam 2 2,150 May-31-2021, 07:00 AM
Last Post: bowlofred
  All possible combinations CODEP 2 1,869 Dec-01-2020, 06:10 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020