Python Forum

Full Version: is there an implied encodin in file names being opened?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
the encoding= option of open() applies to the file contents (according to The Python Library Reference) . there appears to be no way to specify the encoding of the file name. does this mean 'utf-8' is implied? it seems to work that way.
Aren't Python strings UTF-8?
Skaperen Wrote:does this mean 'utf-8' is implied?
I would guess the implied encoding is the one returned by sys.getfilesystemencoding(). As the documentation say, use the str type for the filename for best compatibility.

ndc85430 Wrote:Aren't Python strings UTF-8?
Python strings represent UNICODE strings. Unicode strings are an abstraction without an encoding, a sequence of code points. The way they are stored in memory is another matter, an implementation detail which we should not worry about.
(Mar-09-2022, 06:11 AM)ndc85430 Wrote: [ -> ]Aren't Python strings UTF-8?
no! they are Unicode.
(Mar-09-2022, 08:05 AM)Gribouillis Wrote: [ -> ]sys.getfilesystemencoding().
i missed that when reading. thanks!
A little more about the old friend/enemy Unicode🧲 in Python 3.
As mention so are Python 3 string Unicode code points.
Strings can either be represented in Unicode code points or bytes(can never be mixed 🧬).
Python 3 follow the Unicode standard,so in version Python 3.9 and 3.10 use Unicode 13.0.
>>> from unicodedata import unidata_version
>>> 
# Python 3.10
>>> unidata_version
'13.0.0' # 143,859 characters support

# Python 3.7
>>> unidata_version
'11.0.0'

[Image: 1*b_fkruR5t9-r_t5Tj1JocQ.png]
Default encoding used here is utf-8.
>>> s = 'hello🖐'
>>> s
'hello🖐'
>>> 
>>> b = s.encode() # Same as s.encode('utf-8') 
>>> b
b'hello\xf0\x9f\x96\x90'
>>> b.decode() # Same as b.decode('utf-8') 
'hello🖐'

From Python 3.7 UTF-8 Mode is utf-8 forced in serval more places.
The problem it solves is that the locale is frequently misconfigured.
An obvious solution suggests itself: ignore the locale encoding and use UTF-8.
  • Use UTF-8 as the filesystem encoding
  • sys.getfilesystemencoding() returns 'UTF-8'
  • locale.getpreferredencoding() returns 'UTF-8' (the do_setlocale argument has no effect).
  • sys.stdin, sys.stdout, and sys.stderr all use UTF-8 as their text encoding
  • On Unix, os.device_encoding() returns 'UTF-8'. rather than the device encoding

Crazy test 🧨
VS Code as shown handle Unicode fine.
cmd/PowerShell not can not displays ❓this Unicode,cmder handle it better.
s = '🖐 Crème and Spicy jalapeño ☂'
with open('🖐uni✨code⛄.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('🖐uni✨code⛄.txt', encoding='utf-8') as f:
    data = f.read()
    print(data)
Output:
🖐 Crème and Spicy jalapeño ☂
[Image: SAlNfR.png]
so, if i use an instance of str as the name of a file to open, the name is encoded to UTF-8 before being passed on to the OS/Filesystem. but, if i use an instance of bytes as the name of a file to open, it is neither encoded nor decode since it is already UTF-8 as the OS/Filesystem expects. now where is it defined in the code (of Python) that the OS/Filesystem is UTF-8? what if that were changed to another 8-bit code?

i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.

that code, whether in Python or C, could have a pre-check to see if every character is less than 128 to skip the decoding or encoding.

like:
...
if any(ord(x)>=128 for x in data):
    do_coding(data,de_or_en())
...
It looks like a difficult question. You could probably start reading this series of articles by Victor Stinner (core Python dev) to understand the ins and outs.
(Mar-15-2022, 12:23 AM)Skaperen Wrote: [ -> ]i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.
If want the long story about listdir().
Python 3.0 listdir() Bug on Undecodable Filenames
Victor Stinner Wrote:I wrote a big change modifying os.listdir() to ignore silently undecodable filenames,
but also modify a lot of functions to also accept filenames as bytes.
I made further changes the following years to fix the full Python standard library to accept bytes.

While it "only" took 4 months to fix the os.listdir(str) issue, this kind of bugs will keep me busy the next 10 years (2008-2018)...
So all work done on listdir() in OS modules is inherited in eg scandir and walk and also pathlib.

My Unicode test read fine,it more some OS Terminal/Editors that can struggle with filenames and not Python.
import os

files = os.listdir()
for file in os.listdir('.'):
    print(file)
Output:
list_dir.py uni.py unicode_ver.py 🖐uni✨code⛄.txt
import pathlib

for file in pathlib.Path('.').iterdir():
    if file.is_file():
        print(file)
Output:
list_dir.py uni.py unicode_ver.py 🖐uni✨code⛄.txt
skipping the issue with sys.argv could make things difficult. what if someone using POSIX enters an undecodeable filename as an argument for a command implemented in Python3?

edit 1:

i've already run into this with some file names i encountered in Ubuntu (CA cert names in ISO 8859). i will try to refactor my file tree recursion generator to handle this. correctly.
Pages: 1 2