Python Forum
codec for byte transparency - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: codec for byte transparency (/thread-29744.html)



codec for byte transparency - Skaperen - Sep-18-2020

given "random" bytes in the range from 0 to 255, such as from a file containing machine executable code, i would like to convert this "byte sequence" to str type without applying any UTF-8, or any other, interpretation to it. the following code can do this:
# convert bytes in b to str in s without any encoding
s = ''.join(chr(x)for x in b)
what codec would do the same thing in decode or the exact reverse in encode (when every str character has an ord() value of 255 or lower)?


RE: codec for byte transparency - bowlofred - Sep-18-2020

ord() is the reverse.

>>> b = [ord(x) for x in 'mystring']
>>> b
[109, 121, 115, 116, 114, 105, 110, 103]
>>> ''.join([chr(x) for x in b])
'mystring'



RE: codec for byte transparency - Skaperen - Sep-18-2020

i have functions to convert both ways. the reverse does use ord(). but i want to do this as a codec, now.


RE: codec for byte transparency - DeaD_EyE - Sep-18-2020

Which codec? https://docs.python.org/3/library/codecs.html#standard-encodings
Or do you want to invent a new codec? Look how many codecs are co-existing.

Decode bytes as unicode:
my_bytes = b"\x00\x01Hello World\n\r"
print(my_bytes.decode(errors="ignore"))
Decode bytes as ascii (all chars > 127 are stripped):
my_bytes = b"\x00\x01\xffHello World\n\r"
print(my_bytes.decode("ascii", errors="ignore"))
But text should be always encoded with UTF-8, which is the standard.
Better than the latin1 vs. ascii era.


RE: codec for byte transparency - bowlofred - Sep-18-2020

Pretty sure 'latin1' will reproduce the default behavior of chr()

>>> b = [97, 99, 105, 149, 240]
>>> bytes(b).decode(encoding="latin1")
'aci\x95ð'
>>> "".join([chr(x) for x in b])
'aci\x95ð'



RE: codec for byte transparency - Skaperen - Sep-18-2020

what i want to have in bytes is initially in strings. i want to have it in bytes because the file i get for a pipe is open in binary mode. the content in the string is already encoded in UTF-8 (is not normal Unicode) which then needs to remain unchanged. i'm also doing some structure formatting in binary such as a byte count length put in front. it needs to already be in UTF-8 to get the length right, so conversion to Unicode to work in str makes no sense.

i already have coding patterns of my own to do this conversion. as part of some code cleanup i am trying to make some things be more like what other coders expect, if there are no implementation penalties. i suspect that conversions to a list of ints then to another sequence type is more of a penalty, anyway.


RE: codec for byte transparency - bowlofred - Sep-18-2020

Can you give an example? When you say "the string is already encoded", I don't know how that works. I expect a python str() to be a set of unicode code points. To get it out, we need some encoding for those code points. If you're using the str() as something other than unicode, you'll probably need to give details.

Your first post said the s = ''.join(chr(x)for x in b) step was sufficient. If so, I didn't see how a latin1 decoding was any different.


RE: codec for byte transparency - Skaperen - Sep-25-2020

i'm working with content that is already UTF-8 encoded, i got in bytes. i'm changing the type to str in a transparent way so that UTF-8 encoding remains unchanged. then i work with it in type str. then i can change it back without any more encoding if i ultimately need it as binary to output it (such as a file open in binary mode). if i am the only one writing the code to work with the content, i'll leave it in type bytes. but often i need to call some other code that expects str.