Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
codec for byte transparency
#1
given "random" bytes in the range from 0 to 255, such as from a file containing machine executable code, i would like to convert this "byte sequence" to str type without applying any UTF-8, or any other, interpretation to it. the following code can do this:
# convert bytes in b to str in s without any encoding
s = ''.join(chr(x)for x in b)
what codec would do the same thing in decode or the exact reverse in encode (when every str character has an ord() value of 255 or lower)?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
ord() is the reverse.

>>> b = [ord(x) for x in 'mystring']
>>> b
[109, 121, 115, 116, 114, 105, 110, 103]
>>> ''.join([chr(x) for x in b])
'mystring'
Reply
#3
i have functions to convert both ways. the reverse does use ord(). but i want to do this as a codec, now.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
Which codec? https://docs.python.org/3/library/codecs...-encodings
Or do you want to invent a new codec? Look how many codecs are co-existing.

Decode bytes as unicode:
my_bytes = b"\x00\x01Hello World\n\r"
print(my_bytes.decode(errors="ignore"))
Decode bytes as ascii (all chars > 127 are stripped):
my_bytes = b"\x00\x01\xffHello World\n\r"
print(my_bytes.decode("ascii", errors="ignore"))
But text should be always encoded with UTF-8, which is the standard.
Better than the latin1 vs. ascii era.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
Pretty sure 'latin1' will reproduce the default behavior of chr()

>>> b = [97, 99, 105, 149, 240]
>>> bytes(b).decode(encoding="latin1")
'aci\x95ð'
>>> "".join([chr(x) for x in b])
'aci\x95ð'
Reply
#6
what i want to have in bytes is initially in strings. i want to have it in bytes because the file i get for a pipe is open in binary mode. the content in the string is already encoded in UTF-8 (is not normal Unicode) which then needs to remain unchanged. i'm also doing some structure formatting in binary such as a byte count length put in front. it needs to already be in UTF-8 to get the length right, so conversion to Unicode to work in str makes no sense.

i already have coding patterns of my own to do this conversion. as part of some code cleanup i am trying to make some things be more like what other coders expect, if there are no implementation penalties. i suspect that conversions to a list of ints then to another sequence type is more of a penalty, anyway.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
Can you give an example? When you say "the string is already encoded", I don't know how that works. I expect a python str() to be a set of unicode code points. To get it out, we need some encoding for those code points. If you're using the str() as something other than unicode, you'll probably need to give details.

Your first post said the s = ''.join(chr(x)for x in b) step was sufficient. If so, I didn't see how a latin1 decoding was any different.
Reply
#8
i'm working with content that is already UTF-8 encoded, i got in bytes. i'm changing the type to str in a transparent way so that UTF-8 encoding remains unchanged. then i work with it in type str. then i can change it back without any more encoding if i ultimately need it as binary to output it (such as a file open in binary mode). if i am the only one writing the code to work with the content, i'll leave it in type bytes. but often i need to call some other code that expects str.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont Melcu54 3 4,702 Mar-26-2023, 12:12 PM
Last Post: Gribouillis
  [SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec Winfried 1 988 Nov-16-2022, 11:41 AM
Last Post: Winfried
  UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character Melcu54 7 18,310 Sep-26-2022, 10:09 AM
Last Post: Melcu54
  ASCII-Codec in Python3 [SOLVED] AlphaInc 4 5,988 Jul-07-2021, 07:05 PM
Last Post: AlphaInc
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 error from Mysql call AkaAndrew123 1 3,386 Apr-28-2021, 08:16 AM
Last Post: AkaAndrew123
  'utf-8' codec can't decode byte 0xe2 in position 122031: invalid continuation byte tienttt 12 11,354 Sep-18-2020, 10:10 PM
Last Post: tienttt
  'charmap' codec louis216 4 19,988 Jun-30-2020, 06:25 AM
Last Post: louis216
  'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte karkas 8 31,478 Feb-08-2020, 06:58 PM
Last Post: karkas
  Which codec can help me decode the html source? vivekagrey 4 3,120 Jan-10-2020, 09:33 AM
Last Post: DeaD_EyE
  unicodedecodeerror:utf codec can't decode byte 0xe3 in position 1 mariolopes 3 2,779 Oct-14-2019, 10:17 PM
Last Post: mariolopes

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020