Python Forum
I dont understand bytes in python.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I dont understand bytes in python.
#1
Everything is so confusing here, ie:
>>> z = b'\x93\x39\x02\x49\x83\x02\x82\xf3\x23\xf8\xd3\x13'
>>> list(z)
[147, 57, 2, 73, 131, 2, 130, 243, 35, 248, 211, 19]
So what EXACTLY is the 'x93' and what is corresponding to it '147' and why?

Or, what IN HUMAN LANGUAGE means error like this:
>>> str(z, 'utf-8')
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    str(z, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
Or, how Im supposed to concat 'byte class' objects? '+' Doesnt seem to work.
>>> z = b'\x93\x39\x02\x49\x83\x02\x82\xf3\x23\xf8\xd3\x13'
>>> x = b'\x55'
>>> z+x
b'\x939\x02I\x83\x02\x82\xf3#\xf8\xd3\x13U'
Why its now 'x13U'? Dafuq?

I have completely no idea how computer sees it all, docs are giving me only some confusing and overwhelming details but no bigger picture, and if I cant understand something I will always keep forgetting it.
Reply
#2
blackknite Wrote:I have completely no idea how computer sees it all

The computer sees a sequence of bits such as 0110011100... These bits form groups of 8 bits called bytes. Each byte can be interpreted as an integer in the interval 0-255. So you can say as well that the computer sees a sequence of small integers. This explains the
[147, 57, 2, 73, 131, 2, 130, 243, 35, 248, 211, 19]
Every such integer can be represented in base 16 where the digits are 0, 1, ..., 9, a, ..., f. The advantage is that each of these integers is a two digits number in base 16 in the range 00-ff. For example 93 in base 16 is 9 * 16 + 3 = 147 in base 10. Hence the

z = b'\x93\x39\x02\x49\x83\x02\x82\xf3\x23\xf8\xd3\x13'
The \x here is purely conventional, it is not part of the data. Think of it as meaning 'hexadecimal'.

That's not the whole story. A long time ago, a code was created to represent ordinary letters and a few symbols as numbers. This is the ASCII encoding. For example the code for the capital letter U is 85 or in hexadecimal 55. In the same way, the code for the typographic character 9 is 57 or hexadecimal 39. This explains the

b'\x939\x02I\x83\x02\x82\xf3#\xf8\xd3\x13'
which is just another way to describe the same array of small integers.

Then came the Internet and a system was created that includes all the characters of all the languages around the globe: the Unicode. Its letters are not called letters, instead they are called 'code points'. A code point is an abstraction, there is no predefined number attached to it. Attaching numbers to unicode code points is called an 'encoding'. It is necessary to represent typographic glyphs in computers because computers understand only numbers.

There exists many encodings and the most widely used encoding is UTF-8. Using this, it means that the array of integers can represent a list of unicode letters. The problem is that a code point needs more than one integer to be written because there are more than 256 code points. The consequence is that not all arrays of integers represent a valid list of code points.

When you call str(z, 'utf-8'), you're telling python to translate the sequence of integers to a sequence of unicode code points using the UTF8 encoding, but if it is an invalid sequence, you get the error above.

I hope it's clearer now!
Reply
#3
Thx for replying!
Okay, now its a whole lot clearer. Yet I still may have few small questions:

-How to properly concat two hex strings?

-When to use '0x80' and when 'x80'? In example - 'hex(112)' shows '0x70', but 'b"0x70" == b"x70"' shows 'False' so its not the same I guess.

-Why the '\x93' became 'x939'? Was it like only partialy translated ? It looks more enigmatic and confusing then the hex notation.

-When the code-points in hex string are ok, it should be always automatically translated to utf8 by just calling it?
Reply
#4
blackknite Wrote:-How to properly concat two hex strings?

The bytes type is not a 'hex string'. Think about it as a sequence of integers. You concatenates two bytes strings by adding them:

>>> b'hello' + b'world'
b'helloworld'
>>> list(b'hello')
[104, 101, 108, 108, 111]
>>> list(b'world')
[119, 111, 114, 108, 100]
>>> list(b'helloworld')
[104, 101, 108, 108, 111, 119, 111, 114, 108, 100]
As you can see, the two sequences of integers have properly been concatenated.

blackknite Wrote:-When to use '0x80' and when 'x80'?

Use 0x80 to define an integer in literal python code if you prefer to use the base 16 instead of base 10

>>> 0x80
128
Use \x80 inside a literal bytes string

>>> b'\x80'
b'\x80'
>>>
>>> list(b'x70')
[120, 55, 48]
>>> list(b'\x70')
[112]
blackknite Wrote:-Why the '\x93' became 'x939'?

It didn't. b'\x93\x39' became b'\x939' because \x39 became 9. That's because python uses the ascii code convention
when the byte is a printable character

>>> list(b'9')
[57]
>>> list(b'\x39')
[57]
blaccknite Wrote:-When the code-points in hex string are ok, it should be always automatically translated to utf8 by just calling it?

No, the best practice is to use unicode strings (the str type) when you're dealing with human readable text in the usual sense such as a name, an article, a book etc and to use bytes strings when you're dealing with binary data such as an image or a video file etc. Most of the time you don't have to worry about encoding and decoding because the libraries do most of the hard work for you.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  I dont know why my function won't work? MehHz2526 3 1,193 Nov-28-2022, 09:32 PM
Last Post: deanhystad
  Something the code dont work AlexPython 13 2,229 Oct-17-2022, 08:34 PM
Last Post: AlexPython
  why I dont get any output from this code William369 2 1,120 Jun-23-2022, 09:18 PM
Last Post: William369
Question How to understand the received bytes of ser.read? pf2022 3 1,965 Mar-24-2022, 11:37 AM
Last Post: pf2022
  python 3: TypeError: a bytes-like object is required, not 'str' wardancer84 3 6,472 Jul-09-2021, 05:55 PM
Last Post: deanhystad
  Understand what it means that everything in Python is an object... bytecrunch 8 3,779 Mar-19-2021, 04:47 PM
Last Post: nilamo
  Understand order of magnitude performance gap between python and C++ ThelannOryat 4 2,697 Mar-17-2021, 03:39 PM
Last Post: ThelannOryat
  [split] import PIL dont work vedansh 1 2,073 Mar-29-2020, 10:00 AM
Last Post: Larz60+
  Trying to understand the python code spalisetty 2 1,864 Mar-16-2020, 08:11 AM
Last Post: javiertzr01
  import PIL dont work rodink 14 12,850 Feb-22-2020, 08:48 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020