Python Forum
How to read a file as binary or hex "string" so that I can do regex search?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to read a file as binary or hex "string" so that I can do regex search?
#1
For example,
If I open a file using hex editor, it showed
3C 47 23 56 59 59 66 38 38 5C 67 3A 5A 70 29 32 47 25 28 63 5D 33 32 39 69 30 58 53 2B 4B 44 61 45 27 7A 3F 21 64 76 5B 54 5D 28 46 75 7A 52 5F

How can I get it as "string"?

Should I do below?
---------------------------------------------
hfile1 = open("example1.txt", "rb")
bfile1 = hfile1.read()
----------------------------------------------

But if I use my code above, then I search regex using my code below:
----------------------------------------------
afile1 = re.findall('[0-9A-Fa-f]{2}', bfile1)
----------------------------------------------

There will be error:
TypeError: cannot use a string pattern on a bytes-like object

Then If I modify my regex search using below:
----------------------------------------------
afile1 = re.findall(b'[0-9A-Fa-f]{2}', bfile1)
----------------------------------------------

The regex search result will do regex to String part (Decoded Text) of hex editor:
<G#VYYf88\g:Zp)2G%(c]329i0XS+KDaE'z?!dv[T](FuzR_

So the result is not as I expected

Thank You
Reply
#2
import binascii

# as bytes and removing whitespaces
hex_bytes = b"3C 47 23 56 59 59 66 38 38 5C 67 3A 5A 70 29 32 47 25 28 63 5D 33 32 39 69 30 58 53 2B 4B 44 61 45 27 7A 3F 21 64 76 5B 54 5D 28 46 75 7A 52 5F".replace(b" ", b"")

# decoding from hex to bytes, then decode it to utf8 e.g.
raw_bytes = binascii.unhexlify(hex_bytes)

text = raw_bytes.decode()
print(text)
Output:
<G#VYYf88\g:Zp)2G%(c]329i0XS+KDaE'z?!dv[T](FuzR_
I think you should not use regex for this task. The error from re.find is caused by mismatching data types. If you do a regex on bytes, then the regex must be bytes. If you have instead a str, the regex must be a str.

I guess it's better to understand the file format you would like to read. Regex causes often new problems, were solving the original problem. Sometimes it's difficult and in special case like HTML, there is a mathematical proof that it is impossible to parse html with regex.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
(Dec-19-2024, 08:13 AM)DeaD_EyE Wrote:
import binascii

# as bytes and removing whitespaces
hex_bytes = b"3C 47 23 56 59 59 66 38 38 5C 67 3A 5A 70 29 32 47 25 28 63 5D 33 32 39 69 30 58 53 2B 4B 44 61 45 27 7A 3F 21 64 76 5B 54 5D 28 46 75 7A 52 5F".replace(b" ", b"")

# decoding from hex to bytes, then decode it to utf8 e.g.
raw_bytes = binascii.unhexlify(hex_bytes)

text = raw_bytes.decode()
print(text)
Output:
<G#VYYf88\g:Zp)2G%(c]329i0XS+KDaE'z?!dv[T](FuzR_
I think you should not use regex for this task. The error from re.find is caused by mismatching data types. If you do a regex on bytes, then the regex must be bytes. If you have instead a str, the regex must be a str.

I guess it's better to understand the file format you would like to read. Regex causes often new problems, were solving the original problem. Sometimes it's difficult and in special case like HTML, there is a mathematical proof that it is impossible to parse html with regex.

Then, How can I do this:
I want to compare every two hex code from two files then, all different hex code from both files will be written in text file
Reply
#4
Do not confuse the internal binary content with the displayed hex read in a hex editor.
The file’s raw data is binary,and the hex editor just helps you read it.
Must convert that binary data into a hex string in Python before treating it as something you can apply a [0-9A-Fa-f]{2} regex to it.
import re

with open("example1.txt", "rb") as hfile1:
    bfile1 = hfile1.read()

hex_string = bfile1.hex()
hex_string_space = ' '.join(f"{byte:02X}" for byte in bfile1)
print(hex_string_space)
Output:
3C 47 23 56 59 59 66 38 38 5C 67 3A 5A 70 29 32 47 25 28 63 5D 33 32 39 69 30 58 53 2B 4B 44 61 45 27 7A 3F 21 64 76 5B 54 5D 28 46 75 7A 52 5F
If look at raw data and wonder how it convert.
>>> bfile1
b"<G#VYYf88\\g:Zp)2G%(c]329i0XS+KDaE'z?!dv[T](FuzR_"
Output:
< = 3C, G = 47, # = 23,...ect
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Search in a file using regular expressions ADELE80 2 632 Dec-18-2024, 12:29 PM
Last Post: ADELE80
  Read TXT file in Pandas and save to Parquet zinho 2 1,176 Sep-15-2024, 06:14 PM
Last Post: zinho
  Pycharm can't read file Genericgamemaker 5 1,494 Jul-24-2024, 08:10 PM
Last Post: deanhystad
  Python is unable to read file Genericgamemaker 13 3,370 Jul-19-2024, 06:42 PM
Last Post: snippsat
  Connecting to Remote Server to read contents of a file ChaitanyaSharma 1 3,064 May-03-2024, 07:23 AM
Last Post: Pedroski55
  Basic binary search algorithm - using a while loop Drone4four 1 2,184 Jan-22-2024, 06:34 PM
Last Post: deanhystad
  Writing a Linear Search algorithm - malformed string representation Drone4four 10 3,872 Jan-10-2024, 08:39 AM
Last Post: gulshan212
  Recommended way to read/create PDF file? Winfried 3 4,471 Nov-26-2023, 07:51 AM
Last Post: Pedroski55
  python Read each xlsx file and write it into csv with pipe delimiter mg24 4 3,680 Nov-09-2023, 10:56 AM
Last Post: mg24
  Search Excel File with a list of values huzzug 4 2,679 Nov-03-2023, 05:35 PM
Last Post: huzzug

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020