A little more about the old friend/enemy Unicodeš§² in Python 3.
As mention so are Python 3 string
Unicode code points
.
Strings can either be represented in
Unicode code points
or
bytes
(can never be mixed š§¬).
Python 3 follow the Unicode standard,so in version Python 3.9 and 3.10 use
Unicode 13.0.
>>> from unicodedata import unidata_version
>>>
# Python 3.10
>>> unidata_version
'13.0.0' # 143,859 characters support
# Python 3.7
>>> unidata_version
'11.0.0'
![[Image: 1*b_fkruR5t9-r_t5Tj1JocQ.png]](https://miro.medium.com/max/506/1*b_fkruR5t9-r_t5Tj1JocQ.png)
Default encoding used here is
utf-8
.
>>> s = 'helloš'
>>> s
'helloš'
>>>
>>> b = s.encode() # Same as s.encode('utf-8')
>>> b
b'hello\xf0\x9f\x96\x90'
>>> b.decode() # Same as b.decode('utf-8')
'helloš'
From
Python 3.7 UTF-8 Mode is
utf-8 forced
in serval more places.
The problem it solves is that the
locale
is frequently misconfigured.
An obvious solution suggests itself: ignore the locale encoding and use
UTF-8
.
- Use UTF-8 as the filesystem encoding
sys.getfilesystemencoding()
returns 'UTF-8'
- locale.getpreferredencoding() returns 'UTF-8' (the do_setlocale argument has no effect).
- sys.stdin, sys.stdout, and sys.stderr all use UTF-8 as their text encoding
- On Unix, os.device_encoding() returns 'UTF-8'. rather than the device encoding
Crazy test š§Ø
VS Code as shown handle Unicode fine.
cmd/PowerShell not can not displays āthis Unicode,
cmder handle it better.
s = 'š CrĆØme and Spicy jalapeƱo ā'
with open('šuniāØcodeā.txt', 'w', encoding='utf-8') as f_out:
f_out.write(s)
with open('šuniāØcodeā.txt', encoding='utf-8') as f:
data = f.read()
print(data)
Output:
š CrĆØme and Spicy jalapeƱo ā