is there an implied encodin in file names being opened? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: News and Discussions (https://python-forum.io/forum-31.html) +--- Thread: is there an implied encodin in file names being opened? (/thread-36575.html) Pages:
1
2
|
is there an implied encodin in file names being opened? - Skaperen - Mar-07-2022 the encoding= option of open() applies to the file contents (according to The Python Library Reference) . there appears to be no way to specify the encoding of the file name. does this mean 'utf-8' is implied? it seems to work that way. RE: is there an implied encodin in file names being opened? - ndc85430 - Mar-09-2022 Aren't Python strings UTF-8? RE: is there an implied encodin in file names being opened? - Gribouillis - Mar-09-2022 Skaperen Wrote:does this mean 'utf-8' is implied?I would guess the implied encoding is the one returned by sys.getfilesystemencoding(). As the documentation say, use the str type for the filename for best compatibility. ndc85430 Wrote:Aren't Python strings UTF-8?Python strings represent UNICODE strings. Unicode strings are an abstraction without an encoding, a sequence of code points. The way they are stored in memory is another matter, an implementation detail which we should not worry about. RE: is there an implied encodin in file names being opened? - Skaperen - Mar-10-2022 (Mar-09-2022, 06:11 AM)ndc85430 Wrote: Aren't Python strings UTF-8?no! they are Unicode. RE: is there an implied encodin in file names being opened? - Skaperen - Mar-10-2022 (Mar-09-2022, 08:05 AM)Gribouillis Wrote: sys.getfilesystemencoding().i missed that when reading. thanks! RE: is there an implied encodin in file names being opened? - snippsat - Mar-12-2022 A little more about the old friend/enemy Unicode🧲 in Python 3. As mention so are Python 3 string Unicode code points .Strings can either be represented in Unicode code points or bytes (can never be mixed 🧬).Python 3 follow the Unicode standard,so in version Python 3.9 and 3.10 use Unicode 13.0. >>> from unicodedata import unidata_version >>> # Python 3.10 >>> unidata_version '13.0.0' # 143,859 characters support # Python 3.7 >>> unidata_version '11.0.0' Default encoding used here is utf-8 .>>> s = 'hello🖐' >>> s 'hello🖐' >>> >>> b = s.encode() # Same as s.encode('utf-8') >>> b b'hello\xf0\x9f\x96\x90' >>> b.decode() # Same as b.decode('utf-8') 'hello🖐' From Python 3.7 UTF-8 Mode is utf-8 forced in serval more places.The problem it solves is that the locale is frequently misconfigured. An obvious solution suggests itself: ignore the locale encoding and use UTF-8 .
Crazy test 🧨 VS Code as shown handle Unicode fine. cmd/PowerShell not can not displays ❓this Unicode,cmder handle it better. s = '🖐 Crème and Spicy jalapeño ☂' with open('🖐uni✨code⛄.txt', 'w', encoding='utf-8') as f_out: f_out.write(s) with open('🖐uni✨code⛄.txt', encoding='utf-8') as f: data = f.read() print(data)
RE: is there an implied encodin in file names being opened? - Skaperen - Mar-15-2022 so, if i use an instance of str as the name of a file to open, the name is encoded to UTF-8 before being passed on to the OS/Filesystem. but, if i use an instance of bytes as the name of a file to open, it is neither encoded nor decode since it is already UTF-8 as the OS/Filesystem expects. now where is it defined in the code (of Python) that the OS/Filesystem is UTF-8 ? what if that were changed to another 8-bit code?i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode .that code, whether in Python or C, could have a pre-check to see if every character is less than 128 to skip the decoding or encoding. like: ... if any(ord(x)>=128 for x in data): do_coding(data,de_or_en()) ... RE: is there an implied encodin in file names being opened? - Gribouillis - Mar-15-2022 It looks like a difficult question. You could probably start reading this series of articles by Victor Stinner (core Python dev) to understand the ins and outs. RE: is there an implied encodin in file names being opened? - snippsat - Mar-15-2022 (Mar-15-2022, 12:23 AM)Skaperen Wrote: i presume the output of functions like os.listdir(), which returns a list of str, has decoded each name from UTF-8 to Unicode.If want the long story about listdir() . Python 3.0 listdir() Bug on Undecodable Filenames Victor Stinner Wrote:I wrote a big change modifyingSo all work done on listdir() in OS modules is inherited in eg scandir and walk and also pathlib.My Unicode test read fine,it more some OS Terminal/Editors that can struggle with filenames and not Python. import os files = os.listdir() for file in os.listdir('.'): print(file)
import pathlib for file in pathlib.Path('.').iterdir(): if file.is_file(): print(file)
RE: is there an implied encodin in file names being opened? - Skaperen - Mar-15-2022 skipping the issue with sys.argv could make things difficult. what if someone using POSIX enters an undecodeable filename as an argument for a command implemented in Python3? edit 1: i've already run into this with some file names i encountered in Ubuntu (CA cert names in ISO 8859). i will try to refactor my file tree recursion generator to handle this. correctly. |