Python Forum
is there an implied encodin in file names being opened?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
is there an implied encodin in file names being opened?
#1
the encoding= option of open() applies to the file contents (according to The Python Library Reference) . there appears to be no way to specify the encoding of the file name. does this mean 'utf-8' is implied? it seems to work that way.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Aren't Python strings UTF-8?
Reply
#3
Skaperen Wrote:does this mean 'utf-8' is implied?
I would guess the implied encoding is the one returned by sys.getfilesystemencoding(). As the documentation say, use the str type for the filename for best compatibility.

ndc85430 Wrote:Aren't Python strings UTF-8?
Python strings represent UNICODE strings. Unicode strings are an abstraction without an encoding, a sequence of code points. The way they are stored in memory is another matter, an implementation detail which we should not worry about.
Reply
#4
(Mar-09-2022, 06:11 AM)ndc85430 Wrote: Aren't Python strings UTF-8?
no! they are Unicode.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
(Mar-09-2022, 08:05 AM)Gribouillis Wrote: sys.getfilesystemencoding().
i missed that when reading. thanks!
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#6
A little more about the old friend/enemy Unicode🧲 in Python 3.
As mention so are Python 3 string Unicode code points.
Strings can either be represented in Unicode code points or bytes(can never be mixed 🧬).
Python 3 follow the Unicode standard,so in version Python 3.9 and 3.10 use Unicode 13.0.
>>> from unicodedata import unidata_version
>>> 
# Python 3.10
>>> unidata_version
'13.0.0' # 143,859 characters support

# Python 3.7
>>> unidata_version
'11.0.0'

[Image: 1*b_fkruR5t9-r_t5Tj1JocQ.png]
Default encoding used here is utf-8.
>>> s = 'hello🖐'
>>> s
'hello🖐'
>>> 
>>> b = s.encode() # Same as s.encode('utf-8') 
>>> b
b'hello\xf0\x9f\x96\x90'
>>> b.decode() # Same as b.decode('utf-8') 
'hello🖐'

From Python 3.7 UTF-8 Mode is utf-8 forced in serval more places.
The problem it solves is that the locale is frequently misconfigured.
An obvious solution suggests itself: ignore the locale encoding and use UTF-8.
  • Use UTF-8 as the filesystem encoding
  • sys.getfilesystemencoding() returns 'UTF-8'
  • locale.getpreferredencoding() returns 'UTF-8' (the do_setlocale argument has no effect).
  • sys.stdin, sys.stdout, and sys.stderr all use UTF-8 as their text encoding
  • On Unix, os.device_encoding() returns 'UTF-8'. rather than the device encoding

Crazy test 🧨
VS Code as shown handle Unicode fine.
cmd/PowerShell not can not displays ❓this Unicode,cmder handle it better.
s = '🖐 Crème and Spicy jalapeño ☂'
with open('🖐uni✨code⛄.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('🖐uni✨code⛄.txt', encoding='utf-8') as f:
    data = f.read()
    print(data)
Output:
🖐 Crème and Spicy jalapeño ☂
[Image: SAlNfR.png]
Gribouillis likes this post
Reply
#7
so, if i use an instance of str as the name of a file to open, the name is encoded to UTF-8 before being passed on to the OS/Filesystem. but, if i use an instance of bytes as the name of a file to open, it is neither encoded nor decode since it is already UTF-8 as the OS/Filesystem expects. now where is it defined in the code (of Python) that the OS/Filesystem is UTF-8? what if that were changed to another 8-bit code?

i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.

that code, whether in Python or C, could have a pre-check to see if every character is less than 128 to skip the decoding or encoding.

like:
...
if any(ord(x)>=128 for x in data):
    do_coding(data,de_or_en())
...
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#8
It looks like a difficult question. You could probably start reading this series of articles by Victor Stinner (core Python dev) to understand the ins and outs.
Reply
#9
(Mar-15-2022, 12:23 AM)Skaperen Wrote: i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.
If want the long story about listdir().
Python 3.0 listdir() Bug on Undecodable Filenames
Victor Stinner Wrote:I wrote a big change modifying os.listdir() to ignore silently undecodable filenames,
but also modify a lot of functions to also accept filenames as bytes.
I made further changes the following years to fix the full Python standard library to accept bytes.

While it "only" took 4 months to fix the os.listdir(str) issue, this kind of bugs will keep me busy the next 10 years (2008-2018)...
So all work done on listdir() in OS modules is inherited in eg scandir and walk and also pathlib.

My Unicode test read fine,it more some OS Terminal/Editors that can struggle with filenames and not Python.
import os

files = os.listdir()
for file in os.listdir('.'):
    print(file)
Output:
list_dir.py uni.py unicode_ver.py 🖐uni✨code⛄.txt
import pathlib

for file in pathlib.Path('.').iterdir():
    if file.is_file():
        print(file)
Output:
list_dir.py uni.py unicode_ver.py 🖐uni✨code⛄.txt
Reply
#10
skipping the issue with sys.argv could make things difficult. what if someone using POSIX enters an undecodeable filename as an argument for a command implemented in Python3?

edit 1:

i've already run into this with some file names i encountered in Ubuntu (CA cert names in ISO 8859). i will try to refactor my file tree recursion generator to handle this. correctly.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  coming up with a sentinel string for file names lacaca 1 1,953 Apr-13-2019, 09:31 AM
Last Post: Larz60+
  coming up with a sentinel string for file names Skaperen 4 2,693 Jan-05-2019, 06:28 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020