Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
utf-8
#1
does someone want to make a very pythonic implementation for this: http://rosettacode.org/wiki/UTF-8_encode_and_decode

or any of these: http://rosettacode.org/wiki/Reports:Task..._in_Python

i think i can but my code usually ends up not being very pythonic (because i did C for so long ... so maybe i should do a C implementation).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
I started with it, but there is still a bug inside. The last sign consumes two chars.
Code is for Python 3.x

from binascii import hexlify
import unicodedata

def str_to_hex_str_with_space(char):
   hex_bytes = hexlify(char.encode())
   hex_str = hex_bytes.decode()
   hex_str = ' '.join(hex_str[n:n+2] for n in range(0, len(hex_str), 2))
   return hex_str.upper()


def get_table(code_points):
   for code in code_points:
       char = chr(code)
       name = unicodedata.name(char)
       unic = 'U+{:05X}'.format(code)
       encoded = str_to_hex_str_with_space(char)
       yield char, name, unic, encoded, char


def print_table(code_points):
   header = ['Char', 'Name', 'Unicode', 'UTF-8', 'Decoded']
   fmt_str = '{:<10s}{:<45s}{:<10s}{:<10s}'
   print(fmt_str.format(*header))
   for row in get_table(code_points):
       row = fmt_str.format(*row)
       print(row)


if __name__ == '__main__':
   code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
   print_table(code_points)
Output:
Output:
Char      Name                                         Unicode   UTF-8     A         LATIN CAPITAL LETTER A                       U+00041   41         ö         LATIN SMALL LETTER O WITH DIAERESIS          U+000F6   C3 B6     Ж         CYRILLIC CAPITAL LETTER ZHE                  U+00416   D0 96     €         EURO SIGN                                    U+020AC   E2 82 AC   ?         MUSICAL SYMBOL G CLEF                        U+1D11E   F0 9D 84 9E
Someone other should improve it.
Don't post this code. What I whish: a good replacement for  str_to_hex_str_with_space
And a bugfix for the last sign. It takes two spaces.
After it has been fixed, we should put get_table and print_table in one function.
It's also possible to use directly the unicode signs instead of codepoints.

Edit: I'm not able to post the last sign. The forum doesn't accept it :-(
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
the site is comparing languages. i wonder if it is meant to compare implementations in the language or to compare programming that with the language. the latter would mean built-in/library solutions that come with the language can be used. that would mean a task to convert a string to lower case would be done with .lower() instead of looping through the characters of a string and map them from a lower case mapping.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
The example code on Rosetta Code shows us not the implementation, it shows the usage of a programming language and the library functions.
If you want to implement unicode by your self, good luck :-D
We don't need this in Python 3. It's native unicode.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
(Jun-26-2017, 03:50 AM)DeaD_EyE Wrote: If you want to implement unicode by your self, good luck :-D
BTDT in C.
(Jun-26-2017, 03:50 AM)DeaD_EyE Wrote: We don't need this in Python 3. It's native unicode.
right! and i think we should show that and how easily it is done.
what i don't know, or really need to know, but am curious about, is how much of this is available in Python 2.

anyone starting new projects today can likely go with Python 3, so i think that should be the focus on Rosetta Code.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#6
As mention and show bye @DeaD_EyE it's more about converting stuff,than show use of Unicode.
A little over board on name here @DeaD_EyE str_to_hex_str_with_space Dodgy

Some converting :
>>> code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
>>> uni = [chr(i) for i in code_points]
>>> uni
['A', 'ö', 'Ж', '€', '?']

>>> for c in ''.join(uni): print('U+{:04x}'.format(ord(c)))
U+0041
U+00f6
U+0416
U+20ac
U+1d11e
To show difference in Unicode between Python 2 and 3 is easy.
in Python 3, str represents a Unicode string.
# Python 3.6
>>> s = '200€ and a ☂'
>>> type(s)
<class 'str'>
>>> s
'200€ and a ☂'
>>> print(s)
200€ and a ☂
# Python 2.7
>>> s = '200€ and a ☂'
>>> s
'200\xe2\x82\xac and a \xe2\x98\x82'
>>> print s
200€ and a ☂
# Have to decode to utf-8
>>> print s.decode('utf-8')
200€ and a

Important to think of getting getting Unicode in and out of Python 3
decode in and out:
with open('some_file', encoding='utf-8') as f:
   print(f.read()
Has also parameter to taken a malformed encoded file like errors='ignore' errors='replace'.
with open('some_file', encoding='utf-8', errors='ignore') as f:
   print(f.read())
An other option is read it as bytes rb and the try to convert.
>>> ch = open('chinese.txt', 'rb').read()
>>> type(ch)
<class 'bytes'>
>>> ch
b'\xef\xbb\xbfhi\xe7\x8c\xab'
>>> print(ch.decode('utf-8')) # is now a string(Unicode) in python 3
hi猫
There also no need to utf-8 in s.decode() and s.encode() it use utf-8 as default.

If all fails use ftfy fixes Unicode that’s broken in various ways Wink
An other tips is to  always use Requests when reading from a website,
Requests give correct encoding back (urllib dos not that).
Reply
#7
(Jun-27-2017, 11:16 AM)snippsat Wrote: A little over board on name here @DeaD_EyE str_to_hex_str_with_space Dodgy

Yes, it is too long. I use long names for better teaching. If you read only the name, you know what it does.
In real applications we don't use such long names. Most problem is to find a good name for something.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#8
they want to have characters with high values (e.g. in the big unicode space) converted to utf-8.  they show examples with hexadecimal encoding of integer values for the test characters.  so i would just read the input lines as strings, for each line: split the string into parts, find the part that begins with 'U+'. convert what is after that to an int with int(part[2:],16), encode that into utf-8 bytes, convert the bytes into hex, print out the hex appended to the input line.

something like this untested code:
    for line in sys.stdin:
       tokens = line.split()
       for token in tokens:
           if token[:2].lower() == 'u+':
               utf8 = chr(int(token[2:],16)).encode()
               print(line,' '.join([hex(c).replace('x','')[-2:] for c in utf8]).upper())
in python3, of course
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#9
Post code that can be run @Skaperen,without guessing what sys.stdin is.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020