As mention and show bye @
DeaD_EyE it's more about converting stuff,than show use of Unicode.
A little over board on name here @
DeaD_EyE str_to_hex_str_with_space
Some converting :
>>> code_points = (0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
>>> uni = [chr(i) for i in code_points]
>>> uni
['A', 'ö', 'Ж', '€', '?']
>>> for c in ''.join(uni): print('U+{:04x}'.format(ord(c)))
U+0041
U+00f6
U+0416
U+20ac
U+1d11e
To show difference in Unicode between Python 2 and 3 is easy.
in Python 3,
str
represents a Unicode string.
# Python 3.6
>>> s = '200€ and a ☂'
>>> type(s)
<class 'str'>
>>> s
'200€ and a ☂'
>>> print(s)
200€ and a ☂
# Python 2.7
>>> s = '200€ and a ☂'
>>> s
'200\xe2\x82\xac and a \xe2\x98\x82'
>>> print s
200€ and a ☂
# Have to decode to utf-8
>>> print s.decode('utf-8')
200€ and a
Important to think of getting getting Unicode in and out of Python 3
decode in and out:
with open('some_file', encoding='utf-8') as f:
print(f.read()
Has also parameter to taken a malformed encoded file like
errors='ignore'
errors='replace'
.
with open('some_file', encoding='utf-8', errors='ignore') as f:
print(f.read())
An other option is read it as bytes
rb
and the try to convert.
>>> ch = open('chinese.txt', 'rb').read()
>>> type(ch)
<class 'bytes'>
>>> ch
b'\xef\xbb\xbfhi\xe7\x8c\xab'
>>> print(ch.decode('utf-8')) # is now a string(Unicode) in python 3
hi猫
There also no need to utf-8 in
s.decode()
and
s.encode()
it use utf-8 as default.
If all fails use
ftfy fixes Unicode that’s broken in various ways
An other tips is to always use
Requests when reading from a website,
Requests give correct encoding back (urllib dos not that).