utf-8

***snippsat*** · (This post was last modified: Jun-27-2017, 11:16 AM by snippsat.)

As mention and show bye @DeaD_EyE it's more about converting stuff,than show use of Unicode.
A little over board on name here @DeaD_EyE str_to_hex_str_with_space Dodgy

Some converting :

>>> code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
>>> uni = [chr(i) for i in code_points]
>>> uni
['A', 'ö', 'Ж', '€', '?']

>>> for c in ''.join(uni): print('U+{:04x}'.format(ord(c)))
U+0041
U+00f6
U+0416
U+20ac
U+1d11e

To show difference in Unicode between Python 2 and 3 is easy.
in Python 3, str represents a Unicode string.

# Python 3.6
>>> s = '200€ and a ☂'
>>> type(s)
<class 'str'>
>>> s
'200€ and a ☂'
>>> print(s)
200€ and a ☂

# Python 2.7
>>> s = '200€ and a ☂'
>>> s
'200\xe2\x82\xac and a \xe2\x98\x82'
>>> print s
200â‚¬ and a â˜‚
# Have to decode to utf-8
>>> print s.decode('utf-8')
200€ and a

Important to think of getting getting Unicode in and out of Python 3
decode in and out:

with open('some_file', encoding='utf-8') as f:
   print(f.read()

Has also parameter to taken a malformed encoded file like errors='ignore' errors='replace'.

with open('some_file', encoding='utf-8', errors='ignore') as f:
   print(f.read())

An other option is read it as bytes rb and the try to convert.

>>> ch = open('chinese.txt', 'rb').read()
>>> type(ch)
<class 'bytes'>
>>> ch
b'\xef\xbb\xbfhi\xe7\x8c\xab'
>>> print(ch.decode('utf-8')) # is now a string(Unicode) in python 3
hi猫

There also no need to utf-8 in s.decode() and s.encode() it use utf-8 as default.

If all fails use ftfy fixes Unicode that’s broken in various ways Wink

An other tips is to always use Requests when reading from a website,
Requests give correct encoding back (urllib dos not that).

utf-8

User Panel Messages

Announcements