Python Forum
decoding backslash sequenses
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
decoding backslash sequenses
#1
i am getting some large strings (or bytes) with many kinds of backslash sequences including unicode sequences. is there a codec that can decode all of these to make a better string (str)?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
It depend on the source encoding,if it's utf-8 it should be strait forward.
>>> s = b'hello \xf0\x9f\xa4\xa8'
>>> s.decode() # Same as s.decode('utf-8')
'hello 🤨'

>>> s = 'hello 🤨'
>>> s.encode() #Same as s.encode('utf-8')
b'hello \xf0\x9f\xa4\xa8
From file.
with open('uni_hello.txt', encoding='utf-8') as f:
    print(f.read())
Output:
hello 🤨
Defect encoding or fix if it's all messed up🧬
open("content.txt", encoding='utf-8', errors='replace') 
open("content.txt", encoding='latin-1', errors='ignore')
chardet | python-ftfy | Unidecode
Reply
#3
(Jan-21-2021, 10:12 AM)snippsat Wrote: It depend on the source encoding,if it's utf-8 it should be strait forward.
>>> s = b'hello \xf0\x9f\xa4\xa8'
>>> s.decode() # Same as s.decode('utf-8')
'hello 🤨'

>>> s = 'hello 🤨'
>>> s.encode() #Same as s.encode('utf-8')
b'hello \xf0\x9f\xa4\xa8
From file.
with open('uni_hello.txt', encoding='utf-8') as f:
    print(f.read())
Output:
hello 🤨
Defect encoding or fix if it's all messed up🧬
open("content.txt", encoding='utf-8', errors='replace') 
open("content.txt", encoding='latin-1', errors='ignore')
chardet | python-ftfy | Unidecode

Hey! I know you've answered this one. When i ned decoding like this i use just latin, what is the difference between latin and latin-1 ? If you know, thank you!
Reply
#4
(Jan-21-2021, 12:49 PM)Aspire2Inspire Wrote: Hey! I know you've answered this one. When i ned decoding like this i use just latin, what is the difference between latin and latin-1 ? If you know, thank you!
Standard Encodings
So latin-1 is the top name name for this type of Western Europe encoding.
So the result is the same if use latin.
>>> s = '¼ cup of flour'
>>> s.encode('latin-1')
b'\xbc cup of flour'
>>> 
>>> s.encode('latin')
b'\xbc cup of flour'
>>> 
>>> s.encode('iso-8859-1')
b'\xbc cup of flour'
>>> 
>>> s.encode('L1')
b'\xbc cup of flour'
Reply
#5
i don't know if there is any UTF-8 in it. but to be safe, i probably should assume there can be (even if it comes in str). but i do know there is a variety of backslash encoding in it. it does include things like \n and \t. there is some like \x99 and \u9999 (i'm making up the numbers for here). i doubt the backslash itself, or the digits, are encoded in something else so i think there is no need to double decode. but it does appear that some unexpected encoding exists such as \u0025 that could have been just \x25 even though there are some \x in there. the encoding sure is odd.

right now, everything is in str. that suggests there could be Unicode above ASCII. but this could be in bytes in newer cases. yet i do want to be sure it can handle UTF-8 in both cases. i don't know how well python3 will deal with str having UTF-8 in it.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  supported backslash escapes Skaperen 8 2,342 Jun-23-2022, 11:06 PM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020