Python Forum
read a binary file to find its type
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
read a binary file to find its type
#1

hi. i have a binery file(A) that i need to find its type. Up to now i have

i have up to now
    with open("A", "br") as bf:
    data=bf.read()
    
for d in data[0:10]:
    print(bin(d),' ' ,end='')

bdata=bytearray(data)
print(bdata[2])
but i cannot find how to continue. The first 3 bytes give me the value, but how can i get them? Once i have them then i might need to convert from hexadecimal to integer and print both of them.
Finally, how do i check the value and find its type and print it out, please?
Reply
#2
Often it's much more complicated to guess a file type related to the content. Many file formats do have magic number at the beginning. If you want to get the first 3 bytes, then just read the first 3 bytes:

with open('YourFile.ext', 'rb') as fd:
    file_head = fd.read(3)
file_head has now the first 3 bytes. You can make a predefined table for comparison.

If you want just use a library/wrapper for it, you should look for this modules:
file-magic # official
filemagic

More about Magic numbers.

There are different ways to detect if a given sub string (bytes) is in a string.
If you want to detect PNG, you need for example 8 bytes.

magic_numbers = {'png': bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A])}
max_read_size = max(len(m) for m in magic_numbers.values()) # get max size of magic numbers of the dict

with open('YourFile.png', 'rb') as fd:
    file_head = fd.read(max_read_size)

if file_head.startswith(magic_numbers['png']):
    print("It's a PNG File")
else:
    print("It's not a png file")
It's just an example how you can do it.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
I agree with @DeaD_EyE in that the "Magic" number will probably get you the most 'positive' results. Some things to remember though, the Magic number is not a fixed length. Some may be 3 bytes, others 8, for example. Some files have no Magic number, a good example is the "Plain Text" file (think Windows Notepad). Our beloved Python files (.py) have no Magic number, as they are, after all, plain text files.  You could get some positive results if you tested for "#! /", but of course that is only if the author included the shebang line as the first line of the file.

Quote:... Finally, how do i check the value and find its type and print it out, please?

That depends on if you are looking for one specific file type, a group of similar file types or all file types. For a specific type, you could just assign the number to a variable, for a group, you would probably want a dictionary, for all types you would probably be best served with a database.
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Reply
#4
thanks a lot for the replies. I am looking for jpeg files and the values are:
FF, D8, FF in hex and 255, 216, 255 in integers. taken from https://en.wikipedia.org/wiki/List_of_file_signatures
now i have
b'\xff\xd8\xff'
how do i show that one in each line and next to hex have the conversion to integer, please?
Reply
#5
It should be enuff to test for b'\xff\xd8\xff',then it's a jpg.
Start bits:
$JPEG = "\xFF\xD8\xFF"
$GIF  = "GIF"
$PNG  = "\x89\x50\x4e\x47\x0d\x0a\x1a\x0a"
$BMP  = "BM"
$PSD  = "8BPS"
$SWF  = "FWS"
with open('some.gif', 'rb') as fd:
    file_head = fd.read(3)
print(file_head)
Output:
b'GIF'
atux_nul Wrote:and 255, 216, 255 in integers
>>> s
b'\xff\xd8\xff'
>>> [i for i in s]
[255, 216, 255]
You see on wiki page that it decode to ÿØÿÛ using ISO 8859-1.
>>> s
b'\xff\xd8\xff'
>>> s.decode('iso-8859-1')
'ÿØÿ'
Reply
#6
Hi. i think i got it somewhere wrong and i will start over. I 've been given a file named whatever and it does not have any file extension. My task is to write a program to read its first bytes and see if it is jpeg or not. Also print the first 3 bytes values in hex and in int format.
up to now i have
with open('whatever', 'rb') as fd:
    file_head = fd.read(3)
Anumbers = {'jpeg': bytes([0xFF,0xD8,0xFF])}
max_read_size = max(len(m) for m in Anumbers.values()) 
 
with open('whatever', 'rb') as fd:
    file_head = fd.read(max_read_size)
 
if file_head.startswith(Anumbers['jpeg']):
    print("It's not a JPEG file")
else:
    print("It's a JPEG file")
also i've told that this file probably is not a jpeg, but i am not sure. Some help please?
Reply
#7
Here a test,with this info it should be easy to write a test if it's jpg.
with open('Dreamer.jpg', 'rb') as fd:
    file_head = fd.read(3)
    integer_lst = [i for i in file_head]
    hex_lst = [hex(i) for i in integer_lst]

print(file_head)
print(integer_lst)
print(hex_lst)
Output:
b'\xff\xd8\xff' [255, 216, 255] ['0xff', '0xd8', '0xff']
Gif test:
with open('1.gif', 'rb') as fd:
    file_head = fd.read(3)
    integer_lst = [i for i in file_head]
    hex_lst = [hex(i) for i in integer_lst]

print(file_head)
print(integer_lst)
print(hex_lst)
Output:
b'GIF' [71, 73, 70] ['0x47', '0x49', '0x46']
Reply
#8
An alternate way to get a list with different bases/representations:

signature = b'\xff\xd8\xff'
list(signature) # list of ints
list(map(hex, signatures)) # list of hex values as str in their representation
list(map(oct, signatures)) # list of oct values as str in their representation
list(map(bin, signatures)) # list of bin values as str in their representation
To get the bytes one hex value, you can wrap your head around and think how to do it in a mathematical way:
hex(sum(c << (8*n) for n, c in enumerate(reversed(signature))))

Or you don't wrap your had around and just just use binascii.hexlify(signature).
The function hexlify does not return the str as representation.
All values which can be used in code, do have a representation.
You have following prefixes:
  • 0x for hex
  • 0o for oct
  • 0b for bin
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Using dictionary to find the most sent emails from a file siliusu 6 7,567 Apr-22-2021, 06:07 PM
Last Post: siliusu
  Can we store value in file if we open file in read mode? prasanthbab1234 3 2,566 Sep-26-2020, 12:10 PM
Last Post: ibreeden
  [split] how to read a specific row in CSV file ? laxmipython 2 8,869 May-22-2020, 12:19 PM
Last Post: Larz60+
  Read data from a CSV file in S3 bucket and store it in a dictionary in python Rupini 3 6,980 May-15-2020, 04:57 PM
Last Post: snippsat
  read from file mcgrim 16 6,146 May-14-2019, 10:31 AM
Last Post: mcgrim
  Read directly from excel file using python script dvldgs05 0 2,255 Oct-19-2018, 02:51 AM
Last Post: dvldgs05
  Read a data from text file (Notepad) Leonzxd 24 13,878 May-23-2018, 12:17 AM
Last Post: wavic
  Homework - Read from/Write to file (renamed from Help help help) Amitkafle 1 3,038 Jan-11-2018, 07:24 AM
Last Post: wavic
  Cannot read from text file aljonesy 5 3,599 Oct-05-2017, 05:56 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020