utf-8 - Printable Version

utf-8 - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/forum-1.html)
+--- Forum: News and Discussions (https://python-forum.io/forum-31.html)
+--- Thread: utf-8 (/thread-3791.html)

utf-8 - Skaperen - Jun-25-2017

does someone want to make a very pythonic implementation for this: http://rosettacode.org/wiki/UTF-8_encode_and_decode

or any of these: http://rosettacode.org/wiki/Reports:Tasks_not_implemented_in_Python

i think i can but my code usually ends up not being very pythonic (because i did C for so long ... so maybe i should do a C implementation).

RE: utf-8 - DeaD_EyE - Jun-25-2017

I started with it, but there is still a bug inside. The last sign consumes two chars.
Code is for Python 3.x

from binascii import hexlify
import unicodedata

def str_to_hex_str_with_space(char):
   hex_bytes = hexlify(char.encode())
   hex_str = hex_bytes.decode()
   hex_str = ' '.join(hex_str[n:n+2] for n in range(0, len(hex_str), 2))
   return hex_str.upper()


def get_table(code_points):
   for code in code_points:
       char = chr(code)
       name = unicodedata.name(char)
       unic = 'U+{:05X}'.format(code)
       encoded = str_to_hex_str_with_space(char)
       yield char, name, unic, encoded, char


def print_table(code_points):
   header = ['Char', 'Name', 'Unicode', 'UTF-8', 'Decoded']
   fmt_str = '{:<10s}{:<45s}{:<10s}{:<10s}'
   print(fmt_str.format(*header))
   for row in get_table(code_points):
       row = fmt_str.format(*row)
       print(row)


if __name__ == '__main__':
   code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
   print_table(code_points)

Output:

Output:Char      Name                                         Unicode   UTF-8     
A         LATIN CAPITAL LETTER A                       U+00041   41        
ö         LATIN SMALL LETTER O WITH DIAERESIS          U+000F6   C3 B6     
Ж         CYRILLIC CAPITAL LETTER ZHE                  U+00416   D0 96     
€         EURO SIGN                                    U+020AC   E2 82 AC  
?         MUSICAL SYMBOL G CLEF                        U+1D11E   F0 9D 84 9E

Someone other should improve it.
Don't post this code. What I whish: a good replacement for str_to_hex_str_with_space
And a bugfix for the last sign. It takes two spaces.
After it has been fixed, we should put get_table and print_table in one function.
It's also possible to use directly the unicode signs instead of codepoints.

Edit: I'm not able to post the last sign. The forum doesn't accept it :-(

RE: utf-8 - Skaperen - Jun-26-2017

the site is comparing languages. i wonder if it is meant to compare implementations in the language or to compare programming that with the language. the latter would mean built-in/library solutions that come with the language can be used. that would mean a task to convert a string to lower case would be done with .lower() instead of looping through the characters of a string and map them from a lower case mapping.

RE: utf-8 - DeaD_EyE - Jun-26-2017

The example code on Rosetta Code shows us not the implementation, it shows the usage of a programming language and the library functions.
If you want to implement unicode by your self, good luck :-D
We don't need this in Python 3. It's native unicode.

RE: utf-8 - Skaperen - Jun-27-2017

(Jun-26-2017, 03:50 AM)DeaD_EyE Wrote: If you want to implement unicode by your self, good luck :-D

BTDT in C.

(Jun-26-2017, 03:50 AM)DeaD_EyE Wrote: We don't need this in Python 3. It's native unicode.

right! and i think we should show that and how easily it is done.
what i don't know, or really need to know, but am curious about, is how much of this is available in Python 2.

anyone starting new projects today can likely go with Python 3, so i think that should be the focus on Rosetta Code.

RE: utf-8 - snippsat - Jun-27-2017

As mention and show bye @DeaD_EyE it's more about converting stuff,than show use of Unicode.
A little over board on name here @DeaD_EyE str_to_hex_str_with_space Dodgy

Some converting :

>>> code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
>>> uni = [chr(i) for i in code_points]
>>> uni
['A', 'ö', 'Ж', '€', '?']

>>> for c in ''.join(uni): print('U+{:04x}'.format(ord(c)))
U+0041
U+00f6
U+0416
U+20ac
U+1d11e

To show difference in Unicode between Python 2 and 3 is easy.
in Python 3, str represents a Unicode string.

# Python 3.6
>>> s = '200€ and a ☂'
>>> type(s)
<class 'str'>
>>> s
'200€ and a ☂'
>>> print(s)
200€ and a ☂

# Python 2.7
>>> s = '200€ and a ☂'
>>> s
'200\xe2\x82\xac and a \xe2\x98\x82'
>>> print s
200â‚¬ and a â˜‚
# Have to decode to utf-8
>>> print s.decode('utf-8')
200€ and a

Important to think of getting getting Unicode in and out of Python 3
decode in and out:

with open('some_file', encoding='utf-8') as f:
   print(f.read()

Has also parameter to taken a malformed encoded file like errors='ignore' errors='replace'.

with open('some_file', encoding='utf-8', errors='ignore') as f:
   print(f.read())

An other option is read it as bytes rb and the try to convert.

>>> ch = open('chinese.txt', 'rb').read()
>>> type(ch)
<class 'bytes'>
>>> ch
b'\xef\xbb\xbfhi\xe7\x8c\xab'
>>> print(ch.decode('utf-8')) # is now a string(Unicode) in python 3
hi猫

There also no need to utf-8 in s.decode() and s.encode() it use utf-8 as default.

If all fails use ftfy fixes Unicode that’s broken in various ways Wink

An other tips is to always use Requests when reading from a website,
Requests give correct encoding back (urllib dos not that).

RE: utf-8 - DeaD_EyE - Jun-27-2017

(Jun-27-2017, 11:16 AM)snippsat Wrote: A little over board on name here @DeaD_EyE str_to_hex_str_with_space

Yes, it is too long. I use long names for better teaching. If you read only the name, you know what it does.
In real applications we don't use such long names. Most problem is to find a good name for something.

RE: utf-8 - Skaperen - Jun-29-2017

they want to have characters with high values (e.g. in the big unicode space) converted to utf-8. they show examples with hexadecimal encoding of integer values for the test characters. so i would just read the input lines as strings, for each line: split the string into parts, find the part that begins with 'U+'. convert what is after that to an int with int(part[2:],16), encode that into utf-8 bytes, convert the bytes into hex, print out the hex appended to the input line.

something like this untested code:

    for line in sys.stdin:
       tokens = line.split()
       for token in tokens:
           if token[:2].lower() == 'u+':
               utf8 = chr(int(token[2:],16)).encode()
               print(line,' '.join([hex(c).replace('x','')[-2:] for c in utf8]).upper())

in python3, of course

RE: utf-8 - snippsat - Jun-29-2017

Post code that can be run @Skaperen,without guessing what sys.stdin is.