Python Forum

i am getting some large strings (or bytes) with many kinds of backslash sequences including unicode sequences. is there a codec that can decode all of these to make a better string (str)?

It depend on the source encoding,if it's utf-8 it should be strait forward.

>>> s = b'hello \xf0\x9f\xa4\xa8'
>>> s.decode() # Same as s.decode('utf-8')
'hello 🤨'

>>> s = 'hello 🤨'
>>> s.encode() #Same as s.encode('utf-8')
b'hello \xf0\x9f\xa4\xa8

From file.

with open('uni_hello.txt', encoding='utf-8') as f:
    print(f.read())

Output:
hello 🤨

Defect encoding or fix if it's all messed up🧬

open("content.txt", encoding='utf-8', errors='replace') 
open("content.txt", encoding='latin-1', errors='ignore')

chardet | python-ftfy | Unidecode

(Jan-21-2021, 10:12 AM)snippsat Wrote: [ -> ]It depend on the source encoding,if it's utf-8 it should be strait forward.
>>> s = b'hello \xf0\x9f\xa4\xa8'
>>> s.decode() # Same as s.decode('utf-8')
'hello 🤨'

>>> s = 'hello 🤨'
>>> s.encode() #Same as s.encode('utf-8')
b'hello \xf0\x9f\xa4\xa8
From file.
with open('uni_hello.txt', encoding='utf-8') as f:
    print(f.read())
Output:
hello 🤨
Defect encoding or fix if it's all messed up🧬
open("content.txt", encoding='utf-8', errors='replace') 
open("content.txt", encoding='latin-1', errors='ignore')
chardet | python-ftfy | Unidecode

Hey! I know you've answered this one. When i ned decoding like this i use just latin, what is the difference between latin and latin-1 ? If you know, thank you!

(Jan-21-2021, 12:49 PM)Aspire2Inspire Wrote: [ -> ]Hey! I know you've answered this one. When i ned decoding like this i use just latin, what is the difference between latin and latin-1 ? If you know, thank you!

Standard Encodings
So latin-1 is the top name name for this type of Western Europe encoding.
So the result is the same if use latin.

>>> s = '¼ cup of flour'
>>> s.encode('latin-1')
b'\xbc cup of flour'
>>> 
>>> s.encode('latin')
b'\xbc cup of flour'
>>> 
>>> s.encode('iso-8859-1')
b'\xbc cup of flour'
>>> 
>>> s.encode('L1')
b'\xbc cup of flour'

i don't know if there is any UTF-8 in it. but to be safe, i probably should assume there can be (even if it comes in str). but i do know there is a variety of backslash encoding in it. it does include things like \n and \t. there is some like \x99 and \u9999 (i'm making up the numbers for here). i doubt the backslash itself, or the digits, are encoded in something else so i think there is no need to double decode. but it does appear that some unexpected encoding exists such as \u0025 that could have been just \x25 even though there are some \x in there. the encoding sure is odd.

right now, everything is in str. that suggests there could be Unicode above ASCII. but this could be in bytes in newer cases. yet i do want to be sure it can handle UTF-8 in both cases. i don't know how well python3 will deal with str having UTF-8 in it.

Skaperen

snippsat

Aspire2Inspire

snippsat

Skaperen