Posts: 4,646
Threads: 1,493
Joined: Sep 2016
i am getting some large strings (or bytes) with many kinds of backslash sequences including unicode sequences. is there a codec that can decode all of these to make a better string (str)?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,646
Threads: 1,493
Joined: Sep 2016
i don't know if there is any UTF-8 in it. but to be safe, i probably should assume there can be (even if it comes in str). but i do know there is a variety of backslash encoding in it. it does include things like \n and \t. there is some like \x99 and \u9999 (i'm making up the numbers for here). i doubt the backslash itself, or the digits, are encoded in something else so i think there is no need to double decode. but it does appear that some unexpected encoding exists such as \u0025 that could have been just \x25 even though there are some \x in there. the encoding sure is odd.
right now, everything is in str. that suggests there could be Unicode above ASCII. but this could be in bytes in newer cases. yet i do want to be sure it can handle UTF-8 in both cases. i don't know how well python3 will deal with str having UTF-8 in it.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.