Python Forum

Full Version: examples using os.walk()
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
does anyone have or know of (url) example code that uses os.walk() and produces a list of all file system objects in the order you'd get by going through a sort program that treats the sep character ('/' or '\\') as lower than all other printable characters?
google: "os.walk example python 3"
i tried the first example found, at https://www.tutorialspoint.com/python3/os_walk.htm and it crashes after 24152 lines of output.

Output:
Traceback (most recent call last): File "walk1.py", line 7, in <module> print(os.path.join(root, name)) UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 88: surrogates not allowed
so i just took the first 24100 lines from running the example and sorted it. the input and output were very different. any idea which example shows the correct sorting?
Advise sending bug report to python.org with complete code that crashed.
you do understand that this is a case of UTF-16 like codes that UTF-8 is not supposed to handle? this is more a case of using the wrong encoding. maybe i could have tweaked the code but the sorting issue still exists in os.walk() and that's what i cared about.
This example?

# !/usr/bin/python3
import os

os.chdir("d:\\tmp")
for root, dirs, files in os.walk(".", topdown = False):
   for name in files:
      print(os.path.join(root, name))
   for name in dirs:
      print(os.path.join(root, name))
I guess a filname or path on your filesystem is using invalid unicode.
The error happens in the function print. You can use a kind of hack to ship aroud this:

def encdec_hack(s):
    return s.encode("utf8", errors="ignore").decode()

for root, dirs, files in os.walk("."):
    for f in files:
        try:
            print("File [ OK  ]:", f)
        except UnicodeEncodeError:
            print()
            print("File [ ERR ]:", encdec_hack(f))
The best solution is: Fix your filesystem. Delete or rename the files with broken encoding.

It can also causes trouble with other Applications like an file explorer, backup tools etc...

BTW: The use of pathlib.Path has the same problem. The representation of the Path is ok
To print the representation of something:
print(repr(something))
Frankly, it's unclear what "sorting" issue you have, but am inclined to put my money on it not being a python bug.
It's not a Python bug.

It's a bug of an Application which produces files or directories with illegal encoding.
Often it's just a file downloaded somewhere from the internet.

I had this issue often with files from my Chinese coworkers.
the 24152 files encountered before the file with the name that was not valid Unicode (but is valid POSIX) were sorted wrong. since the problem happened in print() then clearly os.walk delivered that name, or one of the iterators that followed it did.

maybe i can use encode('latin1'). or i can use an encoder i implemented that does no do surrogates

IMHO, the ultimate fix is to Unicode. remove UTF-16 and the surrogates it requires. there is virtually no need for UTF-16 and no need for surrogates without UTF-16.

but that's not the issue i raise. i will try to come up with another way to show the issue, one that does not involve print(). or just run this on a file tree without these names (old music files with names in an ISO code). once you get the list of files. sort them using an order that collates the os.sep character lower than all others. you might want to just keep the list in memory an sort them, there.
it's not a bug in Python or any implementation. it's a design goal issue. os.walk() was designed for speedy delivery of file names in all the directories, not for sorting. my file tree recursion generator was designed for sorting and does accomplish it correctly. below is an example of a script that produces directory names which get stored into a file named "0". then the sort command sorts it with output to a file named "1". finally the command "head 0 1" shows the first ten lines of each file. the source code is output by my script named "box".
Output:
lt2a/forums /home/forums 37> box oswalk.py +----<oswalk.py>------------------------------+ | import os,sys | | t = sys.argv[1] if len(sys.argv)>1 else '.' | | for d,ds,fs in os.walk(t): | | print(d) | +---------------------------------------------+ lt2a/forums /home/forums 38> py oswalk.py /home/forums >0 lt2a/forums /home/forums 39> sort <0 >1 lt2a/forums /home/forums 40> head 0 1 ==> 0 <== /home/forums /home/forums/requests /home/forums/requests/files.pythonhosted.org /home/forums/requests/files.pythonhosted.org/packages /home/forums/requests/files.pythonhosted.org/packages/01 /home/forums/requests/files.pythonhosted.org/packages/01/62 /home/forums/requests/files.pythonhosted.org/packages/01/62/ddcf76d1d19885e8579acb1b1df26a852b03472c0e46d2b959a714c90608 /home/forums/requests/src /home/forums/requests/src/requests-2.22.0 /home/forums/requests/src/requests-2.22.0/requests ==> 1 <== /home/forums /home/forums/.audacity-data /home/forums/.audacity-data/AutoSave /home/forums/.audacity-data/Plug-Ins /home/forums/.bash_history.d /home/forums/.cache /home/forums/.cache/fontconfig /home/forums/.cache/gstreamer-1.0 /home/forums/.cache/mesa_shader_cache /home/forums/.cache/mozilla lt2a/forums /home/forums 41>
Pages: 1 2