is there an implied encodin in file names being opened?

Skaperen · Mar-07-2022, 12:09 AM

the encoding= option of open() applies to the file contents (according to The Python Library Reference) . there appears to be no way to specify the encoding of the file name. does this mean 'utf-8' is implied? it seems to work that way.

ndc85430 · Mar-09-2022, 06:11 AM

Aren't Python strings UTF-8?

**Gribouillis** · (This post was last modified: Mar-09-2022, 08:09 AM by Gribouillis.)

Skaperen Wrote:does this mean 'utf-8' is implied?

I would guess the implied encoding is the one returned by sys.getfilesystemencoding(). As the documentation say, use the str type for the filename for best compatibility.

ndc85430 Wrote:Aren't Python strings UTF-8?

Python strings represent UNICODE strings. Unicode strings are an abstraction without an encoding, a sequence of code points. The way they are stored in memory is another matter, an implementation detail which we should not worry about.

Skaperen · Mar-10-2022, 01:01 AM

(Mar-09-2022, 06:11 AM)ndc85430 Wrote: Aren't Python strings UTF-8?

no! they are Unicode.

Skaperen · Mar-10-2022, 01:10 AM

(Mar-09-2022, 08:05 AM)Gribouillis Wrote: sys.getfilesystemencoding().

i missed that when reading. thanks!

***snippsat*** · (This post was last modified: Mar-12-2022, 10:45 PM by Gribouillis.)

A little more about the old friend/enemy Unicode🧲 in Python 3.
As mention so are Python 3 string Unicode code points.
Strings can either be represented in Unicode code points or bytes(can never be mixed 🧬).
Python 3 follow the Unicode standard,so in version Python 3.9 and 3.10 use Unicode 13.0.

>>> from unicodedata import unidata_version
>>> 
# Python 3.10
>>> unidata_version
'13.0.0' # 143,859 characters support

# Python 3.7
>>> unidata_version
'11.0.0'

Default encoding used here is utf-8.

>>> s = 'hello🖐'
>>> s
'hello🖐'
>>> 
>>> b = s.encode() # Same as s.encode('utf-8') 
>>> b
b'hello\xf0\x9f\x96\x90'
>>> b.decode() # Same as b.decode('utf-8') 
'hello🖐'

From Python 3.7 UTF-8 Mode is utf-8 forced in serval more places.
The problem it solves is that the locale is frequently misconfigured.
An obvious solution suggests itself: ignore the locale encoding and use UTF-8.

Use UTF-8 as the filesystem encoding
sys.getfilesystemencoding() returns 'UTF-8'
locale.getpreferredencoding() returns 'UTF-8' (the do_setlocale argument has no effect).
sys.stdin, sys.stdout, and sys.stderr all use UTF-8 as their text encoding
On Unix, os.device_encoding() returns 'UTF-8'. rather than the device encoding

Crazy test 🧨
VS Code as shown handle Unicode fine.
cmd/PowerShell not can not displays ❓this Unicode,cmder handle it better.

s = '🖐 Crème and Spicy jalapeño ☂'
with open('🖐uni✨code⛄.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('🖐uni✨code⛄.txt', encoding='utf-8') as f:
    data = f.read()
    print(data)

Output:
🖐 Crème and Spicy jalapeño ☂

Skaperen · (This post was last modified: Mar-15-2022, 12:24 AM by Skaperen.)

so, if i use an instance of str as the name of a file to open, the name is encoded to UTF-8 before being passed on to the OS/Filesystem. but, if i use an instance of bytes as the name of a file to open, it is neither encoded nor decode since it is already UTF-8 as the OS/Filesystem expects. now where is it defined in the code (of Python) that the OS/Filesystem is UTF-8? what if that were changed to another 8-bit code?

i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.

that code, whether in Python or C, could have a pre-check to see if every character is less than 128 to skip the decoding or encoding.

like:

...
if any(ord(x)>=128 for x in data):
    do_coding(data,de_or_en())
...

**Gribouillis** · Mar-15-2022, 07:38 AM

It looks like a difficult question. You could probably start reading this series of articles by Victor Stinner (core Python dev) to understand the ins and outs.

***snippsat*** · (This post was last modified: Mar-15-2022, 03:08 PM by snippsat.)

(Mar-15-2022, 12:23 AM)Skaperen Wrote: i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.

If want the long story about listdir().
Python 3.0 listdir() Bug on Undecodable Filenames

Victor Stinner Wrote:I wrote a big change modifying os.listdir() to ignore silently undecodable filenames,
but also modify a lot of functions to also accept filenames as bytes.
I made further changes the following years to fix the full Python standard library to accept bytes.

While it "only" took 4 months to fix the os.listdir(str) issue, this kind of bugs will keep me busy the next 10 years (2008-2018)...

So all work done on listdir() in OS modules is inherited in eg scandir and walk and also pathlib.

My Unicode test read fine,it more some OS Terminal/Editors that can struggle with filenames and not Python.

import os

files = os.listdir()
for file in os.listdir('.'):
    print(file)

Output:list_dir.py
uni.py
unicode_ver.py
🖐uni✨code⛄.txt

import pathlib

for file in pathlib.Path('.').iterdir():
    if file.is_file():
        print(file)

Output:list_dir.py
uni.py
unicode_ver.py
🖐uni✨code⛄.txt

Skaperen

skipping the issue with sys.argv could make things difficult. what if someone using POSIX enters an undecodeable filename as an argument for a command implemented in Python3?

edit 1:

i've already run into this with some file names i encountered in Ubuntu (CA cert names in ISO 8859). i will try to refactor my file tree recursion generator to handle this. correctly.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	coming up with a sentinel string for file names	lacaca	1	2,580	Apr-13-2019, 09:31 AM Last Post: Larz60+
	coming up with a sentinel string for file names	Skaperen	4	3,748	Jan-05-2019, 06:28 AM Last Post: Skaperen

is there an implied encodin in file names being opened?

User Panel Messages

Announcements