codec for byte transparency

Skaperen · (This post was last modified: Sep-18-2020, 02:25 AM by Skaperen.)

given "random" bytes in the range from 0 to 255, such as from a file containing machine executable code, i would like to convert this "byte sequence" to str type without applying any UTF-8, or any other, interpretation to it. the following code can do this:

# convert bytes in b to str in s without any encoding
s = ''.join(chr(x)for x in b)

what codec would do the same thing in decode or the exact reverse in encode (when every str character has an ord() value of 255 or lower)?

bowlofred · Sep-18-2020, 03:01 AM

ord() is the reverse.

>>> b = [ord(x) for x in 'mystring']
>>> b
[109, 121, 115, 116, 114, 105, 110, 103]
>>> ''.join([chr(x) for x in b])
'mystring'

Skaperen · Sep-18-2020, 03:12 AM

i have functions to convert both ways. the reverse does use ord(). but i want to do this as a codec, now.

DeaD_EyE · (This post was last modified: Sep-18-2020, 06:56 AM by DeaD_EyE.)

Which codec? https://docs.python.org/3/library/codecs...-encodings
Or do you want to invent a new codec? Look how many codecs are co-existing.

Decode bytes as unicode:

my_bytes = b"\x00\x01Hello World\n\r"
print(my_bytes.decode(errors="ignore"))

Decode bytes as ascii (all chars > 127 are stripped):

my_bytes = b"\x00\x01\xffHello World\n\r"
print(my_bytes.decode("ascii", errors="ignore"))

But text should be always encoded with UTF-8, which is the standard.
Better than the latin1 vs. ascii era.

bowlofred · Sep-18-2020, 07:23 AM

Pretty sure 'latin1' will reproduce the default behavior of chr()

>>> b = [97, 99, 105, 149, 240]
>>> bytes(b).decode(encoding="latin1")
'aci\x95ð'
>>> "".join([chr(x) for x in b])
'aci\x95ð'

Skaperen · (This post was last modified: Sep-18-2020, 10:41 PM by Skaperen.)

what i want to have in bytes is initially in strings. i want to have it in bytes because the file i get for a pipe is open in binary mode. the content in the string is already encoded in UTF-8 (is not normal Unicode) which then needs to remain unchanged. i'm also doing some structure formatting in binary such as a byte count length put in front. it needs to already be in UTF-8 to get the length right, so conversion to Unicode to work in str makes no sense.

i already have coding patterns of my own to do this conversion. as part of some code cleanup i am trying to make some things be more like what other coders expect, if there are no implementation penalties. i suspect that conversions to a list of ints then to another sequence type is more of a penalty, anyway.

bowlofred · Sep-18-2020, 11:06 PM

Can you give an example? When you say "the string is already encoded", I don't know how that works. I expect a python str() to be a set of unicode code points. To get it out, we need some encoding for those code points. If you're using the str() as something other than unicode, you'll probably need to give details.

Your first post said the s = ''.join(chr(x)for x in b) step was sufficient. If so, I didn't see how a latin1 decoding was any different.

Skaperen · (This post was last modified: Sep-25-2020, 02:21 AM by Skaperen.)

i'm working with content that is already UTF-8 encoded, i got in bytes. i'm changing the type to str in a transparent way so that UTF-8 encoding remains unchanged. then i work with it in type str. then i can change it back without any more encoding if i ultimately need it as binary to output it (such as a file open in binary mode). if i am the only one writing the code to work with the content, i'll leave it in type bytes. but often i need to call some other code that expects str.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord	ctrldan	23	10,215	Apr-24-2023, 03:40 PM Last Post: ctrldan
	UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont	Melcu54	3	11,508	Mar-26-2023, 12:12 PM Last Post: Gribouillis
	[SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec	Winfried	1	1,744	Nov-16-2022, 11:41 AM Last Post: Winfried
	UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character	Melcu54	7	30,661	Sep-26-2022, 10:09 AM Last Post: Melcu54
	ASCII-Codec in Python3 [SOLVED]	AlphaInc	4	9,877	Jul-07-2021, 07:05 PM Last Post: AlphaInc
	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 error from Mysql call	AkaAndrew123	1	4,426	Apr-28-2021, 08:16 AM Last Post: AkaAndrew123
	'utf-8' codec can't decode byte 0xe2 in position 122031: invalid continuation byte	tienttt	12	16,960	Sep-18-2020, 10:10 PM Last Post: tienttt
	'charmap' codec	louis216	4	24,293	Jun-30-2020, 06:25 AM Last Post: louis216
	'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte	karkas	8	57,768	Feb-08-2020, 06:58 PM Last Post: karkas
	Which codec can help me decode the html source?	vivekagrey	4	4,384	Jan-10-2020, 09:33 AM Last Post: DeaD_EyE

codec for byte transparency

User Panel Messages

Announcements