the encoding= option of open() applies to the file contents (according to The Python Library Reference) . there appears to be no way to specify the encoding of the file name. does this mean 'utf-8' is implied? it seems to work that way.
Aren't Python strings UTF-8?
Skaperen Wrote:does this mean 'utf-8' is implied?
I would guess the implied encoding is the one returned by
sys.getfilesystemencoding(). As the documentation say, use the str type for the filename for best compatibility.
ndc85430 Wrote:Aren't Python strings UTF-8?
Python strings represent UNICODE strings. Unicode strings are an abstraction without an encoding, a sequence of code points. The way they are stored in memory is another matter, an implementation detail which we should not worry about.
A little more about the old friend/enemy Unicode🧲 in Python 3.
As mention so are Python 3 string
Unicode code points
.
Strings can either be represented in
Unicode code points
or
bytes
(can never be mixed 🧬).
Python 3 follow the Unicode standard,so in version Python 3.9 and 3.10 use
Unicode 13.0.
>>> from unicodedata import unidata_version
>>>
# Python 3.10
>>> unidata_version
'13.0.0' # 143,859 characters support
# Python 3.7
>>> unidata_version
'11.0.0'
![[Image: 1*b_fkruR5t9-r_t5Tj1JocQ.png]](https://miro.medium.com/max/506/1*b_fkruR5t9-r_t5Tj1JocQ.png)
Default encoding used here is
utf-8
.
>>> s = 'hello🖐'
>>> s
'hello🖐'
>>>
>>> b = s.encode() # Same as s.encode('utf-8')
>>> b
b'hello\xf0\x9f\x96\x90'
>>> b.decode() # Same as b.decode('utf-8')
'hello🖐'
From
Python 3.7 UTF-8 Mode is
utf-8 forced
in serval more places.
The problem it solves is that the
locale
is frequently misconfigured.
An obvious solution suggests itself: ignore the locale encoding and use
UTF-8
.
- Use UTF-8 as the filesystem encoding
sys.getfilesystemencoding()
returns 'UTF-8'
- locale.getpreferredencoding() returns 'UTF-8' (the do_setlocale argument has no effect).
- sys.stdin, sys.stdout, and sys.stderr all use UTF-8 as their text encoding
- On Unix, os.device_encoding() returns 'UTF-8'. rather than the device encoding
Crazy test 🧨
VS Code as shown handle Unicode fine.
cmd/PowerShell not can not displays ❓this Unicode,
cmder handle it better.
s = '🖐 Crème and Spicy jalapeño ☂'
with open('🖐uni✨code⛄.txt', 'w', encoding='utf-8') as f_out:
f_out.write(s)
with open('🖐uni✨code⛄.txt', encoding='utf-8') as f:
data = f.read()
print(data)
Output:
🖐 Crème and Spicy jalapeño ☂
![[Image: SAlNfR.png]](https://imagizer.imageshack.com/v2/xq90/924/SAlNfR.png)
so, if i use an instance of
str
as the name of a file to open, the name is encoded to
UTF-8
before being passed on to the
OS/Filesystem. but, if i use an instance of
bytes
as the name of a file to open, it is neither encoded nor decode since it is already
UTF-8
as the
OS/Filesystem expects. now where is it defined in the code (of Python) that the
OS/Filesystem is
UTF-8
? what if that were changed to another 8-bit code?
i presume the output of functions like
os.listdir(), which returns a list of str, has decoded each name from
UTF-8
to
Unicode
.
that code, whether in Python or C, could have a pre-check to see if every character is less than 128 to skip the decoding or encoding.
like:
...
if any(ord(x)>=128 for x in data):
do_coding(data,de_or_en())
...
It looks like a difficult question. You could probably start reading this
series of articles by Victor Stinner (core Python dev) to understand the ins and outs.
(Mar-15-2022, 12:23 AM)Skaperen Wrote: [ -> ]i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.
If want the long story about
listdir()
.
Python 3.0 listdir() Bug on Undecodable Filenames
Victor Stinner Wrote:I wrote a big change modifying os.listdir()
to ignore silently undecodable filenames
,
but also modify a lot of functions to also accept filenames as bytes
.
I made further changes the following years to fix the full Python standard library to accept bytes.
While it "only" took 4 months to fix the os.listdir(str)
issue, this kind of bugs will keep me busy the next 10 years
(2008-2018)...
So all work done on
listdir()
in OS modules is inherited in eg
scandir and
walk and also
pathlib.
My Unicode test read fine,it more some OS Terminal/Editors that can struggle with filenames and not Python.
import os
files = os.listdir()
for file in os.listdir('.'):
print(file)
Output:
list_dir.py
uni.py
unicode_ver.py
🖐uni✨code⛄.txt
import pathlib
for file in pathlib.Path('.').iterdir():
if file.is_file():
print(file)
Output:
list_dir.py
uni.py
unicode_ver.py
🖐uni✨code⛄.txt
skipping the issue with sys.argv could make things difficult. what if someone using POSIX enters an undecodeable filename as an argument for a command implemented in Python3?
edit 1:
i've already run into this with some file names i encountered in Ubuntu (CA cert names in ISO 8859). i will try to refactor my file tree recursion generator to handle this. correctly.