![]() |
utf-8 - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: News and Discussions (https://python-forum.io/forum-31.html) +--- Thread: utf-8 (/thread-3791.html) |
utf-8 - Skaperen - Jun-25-2017 does someone want to make a very pythonic implementation for this: http://rosettacode.org/wiki/UTF-8_encode_and_decode or any of these: http://rosettacode.org/wiki/Reports:Tasks_not_implemented_in_Python i think i can but my code usually ends up not being very pythonic (because i did C for so long ... so maybe i should do a C implementation). RE: utf-8 - DeaD_EyE - Jun-25-2017 I started with it, but there is still a bug inside. The last sign consumes two chars. Code is for Python 3.x from binascii import hexlify import unicodedata def str_to_hex_str_with_space(char): hex_bytes = hexlify(char.encode()) hex_str = hex_bytes.decode() hex_str = ' '.join(hex_str[n:n+2] for n in range(0, len(hex_str), 2)) return hex_str.upper() def get_table(code_points): for code in code_points: char = chr(code) name = unicodedata.name(char) unic = 'U+{:05X}'.format(code) encoded = str_to_hex_str_with_space(char) yield char, name, unic, encoded, char def print_table(code_points): header = ['Char', 'Name', 'Unicode', 'UTF-8', 'Decoded'] fmt_str = '{:<10s}{:<45s}{:<10s}{:<10s}' print(fmt_str.format(*header)) for row in get_table(code_points): row = fmt_str.format(*row) print(row) if __name__ == '__main__': code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E) print_table(code_points)Output: Someone other should improve it.Don't post this code. What I whish: a good replacement for str_to_hex_str_with_space And a bugfix for the last sign. It takes two spaces. After it has been fixed, we should put get_table and print_table in one function. It's also possible to use directly the unicode signs instead of codepoints. Edit: I'm not able to post the last sign. The forum doesn't accept it :-( RE: utf-8 - Skaperen - Jun-26-2017 the site is comparing languages. i wonder if it is meant to compare implementations in the language or to compare programming that with the language. the latter would mean built-in/library solutions that come with the language can be used. that would mean a task to convert a string to lower case would be done with .lower() instead of looping through the characters of a string and map them from a lower case mapping. RE: utf-8 - DeaD_EyE - Jun-26-2017 The example code on Rosetta Code shows us not the implementation, it shows the usage of a programming language and the library functions. If you want to implement unicode by your self, good luck :-D We don't need this in Python 3. It's native unicode. RE: utf-8 - Skaperen - Jun-27-2017 (Jun-26-2017, 03:50 AM)DeaD_EyE Wrote: If you want to implement unicode by your self, good luck :-DBTDT in C. (Jun-26-2017, 03:50 AM)DeaD_EyE Wrote: We don't need this in Python 3. It's native unicode.right! and i think we should show that and how easily it is done. what i don't know, or really need to know, but am curious about, is how much of this is available in Python 2. anyone starting new projects today can likely go with Python 3, so i think that should be the focus on Rosetta Code. RE: utf-8 - snippsat - Jun-27-2017 As mention and show bye @DeaD_EyE it's more about converting stuff,than show use of Unicode. A little over board on name here @DeaD_EyE str_to_hex_str_with_space ![]() Some converting : >>> code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E) >>> uni = [chr(i) for i in code_points] >>> uni ['A', 'ö', 'Ж', '€', '?'] >>> for c in ''.join(uni): print('U+{:04x}'.format(ord(c))) U+0041 U+00f6 U+0416 U+20ac U+1d11eTo show difference in Unicode between Python 2 and 3 is easy. in Python 3, str represents a Unicode string.# Python 3.6 >>> s = '200€ and a ☂' >>> type(s) <class 'str'> >>> s '200€ and a ☂' >>> print(s) 200€ and a ☂ # Python 2.7 >>> s = '200€ and a ☂' >>> s '200\xe2\x82\xac and a \xe2\x98\x82' >>> print s 200€ and a ☂ # Have to decode to utf-8 >>> print s.decode('utf-8') 200€ and a Important to think of getting getting Unicode in and out of Python 3 decode in and out: with open('some_file', encoding='utf-8') as f: print(f.read()Has also parameter to taken a malformed encoded file like errors='ignore' errors='replace' .with open('some_file', encoding='utf-8', errors='ignore') as f: print(f.read())An other option is read it as bytes rb and the try to convert.>>> ch = open('chinese.txt', 'rb').read() >>> type(ch) <class 'bytes'> >>> ch b'\xef\xbb\xbfhi\xe7\x8c\xab' >>> print(ch.decode('utf-8')) # is now a string(Unicode) in python 3 hi猫There also no need to utf-8 in s.decode() and s.encode() it use utf-8 as default.If all fails use ftfy fixes Unicode that’s broken in various ways ![]() An other tips is to always use Requests when reading from a website, Requests give correct encoding back (urllib dos not that). RE: utf-8 - DeaD_EyE - Jun-27-2017 (Jun-27-2017, 11:16 AM)snippsat Wrote: A little over board on name here @DeaD_EyE Yes, it is too long. I use long names for better teaching. If you read only the name, you know what it does. In real applications we don't use such long names. Most problem is to find a good name for something. RE: utf-8 - Skaperen - Jun-29-2017 they want to have characters with high values (e.g. in the big unicode space) converted to utf-8. they show examples with hexadecimal encoding of integer values for the test characters. so i would just read the input lines as strings, for each line: split the string into parts, find the part that begins with 'U+'. convert what is after that to an int with int(part[2:],16), encode that into utf-8 bytes, convert the bytes into hex, print out the hex appended to the input line. something like this untested code: for line in sys.stdin: tokens = line.split() for token in tokens: if token[:2].lower() == 'u+': utf8 = chr(int(token[2:],16)).encode() print(line,' '.join([hex(c).replace('x','')[-2:] for c in utf8]).upper())in python3, of course RE: utf-8 - snippsat - Jun-29-2017 Post code that can be run @Skaperen,without guessing what sys.stdin is. |