Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
maybe a generator?
#1
i have written some code that converts Unicode to UTF-8 and code that converts UTF-8 to Unicode. i have also written code that converts bytes to printable escape sequences and the reverse. what i am trying to decide is if i should make this into a generator, and if so, whether it should do it's thing with yield or with sends. this impacts how the programmer using this calls it.whether they will be providing date by making their own generator doing yields or doing sends]. since there is an input and an output. for the escape sequence code, a pipeline filter is another option. which way do you think i should make it go?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
(Sep-11-2018, 04:53 AM)Skaperen Wrote: i have written some code that converts Unicode to UTF-8 and code that converts UTF-8 to Unicode.
from wikipedia:
Quote:Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.

what exactly is conversion from Unicode to UTF-8 and vice versa? i.e. UTF-8 is one of encodings that implement Unicode standard...
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
converting Unicode to UTF-8 involves taking the numeric value of a Unicode code point and performing some bit shifting to produce 1 or more output octets, numeric values within specified ranges within range(128,256). Unicode values in range(0,128) are unchanged meaning that a 7-bit ASCII stream can be used as a UTF-8 compliant stream with no changes. Unicode character values in the range (128,1114112) are converted to 2 or more octets for transmission or storage as an 8-bit stream. the exact details of the conversion is shown in the Unicode standard and Wikipedia article. i presume the Python code has this somewhere in either Python or C. my code definitely shows it and it is in Python. my code includes Modified UTF-8 as an option, as well as support for Extended UTF-8 and Ultra UTF-8 (both of which are just greater ranges that can be converted). i have no idea if the Python encoding can do MUTF-8, XUTF-8, or UUTF-8, but i very much doubt MUTF-8 since you should have a means to enable/disable it and i have seen none (but my code has an option flag). the Python code also does handle surrogates.

conversion of UTF-8 to Unicode is simply a matter of reversing the conversion process to get the Unicode values from the stream of UTF-8 octets. but it also involves validating the octets as there are many octet sequences that are not valid.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020