Posts: 4,648
Threads: 1,495
Joined: Sep 2016
i have written some code that converts
Unicode to
UTF-8 and code that converts
UTF-8 to
Unicode. i have also written code that converts bytes to
printable escape sequences and the reverse. what i am trying to decide is if i should make this into a generator, and if so, whether it should do it's thing with
yield or with
sends. this impacts how the programmer using this calls it.whether they will be providing date by making their own generator doing
yields or doing
sends]. since there is an input and an output. for the escape sequence code, a pipeline filter is another option. which way do
you think i should make it go?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,648
Threads: 1,495
Joined: Sep 2016
Sep-11-2018, 09:00 PM
(This post was last modified: Sep-11-2018, 09:00 PM by Skaperen.)
converting Unicode to UTF-8 involves taking the numeric value of a Unicode code point and performing some bit shifting to produce 1 or more output octets, numeric values within specified ranges within range(128,256). Unicode values in range(0,128) are unchanged meaning that a 7-bit ASCII stream can be used as a UTF-8 compliant stream with no changes. Unicode character values in the range (128,1114112) are converted to 2 or more octets for transmission or storage as an 8-bit stream. the exact details of the conversion is shown in the Unicode standard and Wikipedia article. i presume the Python code has this somewhere in either Python or C. my code definitely shows it and it is in Python. my code includes Modified UTF-8 as an option, as well as support for Extended UTF-8 and Ultra UTF-8 (both of which are just greater ranges that can be converted). i have no idea if the Python encoding can do MUTF-8, XUTF-8, or UUTF-8, but i very much doubt MUTF-8 since you should have a means to enable/disable it and i have seen none (but my code has an option flag). the Python code also does handle surrogates.
conversion of UTF-8 to Unicode is simply a matter of reversing the conversion process to get the Unicode values from the stream of UTF-8 octets. but it also involves validating the octets as there are many octet sequences that are not valid.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.