Python Forum

Full Version: codec for byte transparency
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
given "random" bytes in the range from 0 to 255, such as from a file containing machine executable code, i would like to convert this "byte sequence" to str type without applying any UTF-8, or any other, interpretation to it. the following code can do this:
# convert bytes in b to str in s without any encoding
s = ''.join(chr(x)for x in b)
what codec would do the same thing in decode or the exact reverse in encode (when every str character has an ord() value of 255 or lower)?
ord() is the reverse.

>>> b = [ord(x) for x in 'mystring']
>>> b
[109, 121, 115, 116, 114, 105, 110, 103]
>>> ''.join([chr(x) for x in b])
'mystring'
i have functions to convert both ways. the reverse does use ord(). but i want to do this as a codec, now.
Which codec? https://docs.python.org/3/library/codecs...-encodings
Or do you want to invent a new codec? Look how many codecs are co-existing.

Decode bytes as unicode:
my_bytes = b"\x00\x01Hello World\n\r"
print(my_bytes.decode(errors="ignore"))
Decode bytes as ascii (all chars > 127 are stripped):
my_bytes = b"\x00\x01\xffHello World\n\r"
print(my_bytes.decode("ascii", errors="ignore"))
But text should be always encoded with UTF-8, which is the standard.
Better than the latin1 vs. ascii era.
Pretty sure 'latin1' will reproduce the default behavior of chr()

>>> b = [97, 99, 105, 149, 240]
>>> bytes(b).decode(encoding="latin1")
'aci\x95ð'
>>> "".join([chr(x) for x in b])
'aci\x95ð'
what i want to have in bytes is initially in strings. i want to have it in bytes because the file i get for a pipe is open in binary mode. the content in the string is already encoded in UTF-8 (is not normal Unicode) which then needs to remain unchanged. i'm also doing some structure formatting in binary such as a byte count length put in front. it needs to already be in UTF-8 to get the length right, so conversion to Unicode to work in str makes no sense.

i already have coding patterns of my own to do this conversion. as part of some code cleanup i am trying to make some things be more like what other coders expect, if there are no implementation penalties. i suspect that conversions to a list of ints then to another sequence type is more of a penalty, anyway.
Can you give an example? When you say "the string is already encoded", I don't know how that works. I expect a python str() to be a set of unicode code points. To get it out, we need some encoding for those code points. If you're using the str() as something other than unicode, you'll probably need to give details.

Your first post said the s = ''.join(chr(x)for x in b) step was sufficient. If so, I didn't see how a latin1 decoding was any different.
i'm working with content that is already UTF-8 encoded, i got in bytes. i'm changing the type to str in a transparent way so that UTF-8 encoding remains unchanged. then i work with it in type str. then i can change it back without any more encoding if i ultimately need it as binary to output it (such as a file open in binary mode). if i am the only one writing the code to work with the content, i'll leave it in type bytes. but often i need to call some other code that expects str.