Python Forum
short version of os.listdir()
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
short version of os.listdir()
#1
is there a way to shorten os.listdir() such as to have it only read several (like maybe 32 to 256) names at a time? i need to scan through a massively huge directory and it is have trouble with it being so big. the directory has well over 70 million files.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Don’t use os.listdir. Use pathlib.iterdir
Reply
#3
Fun with generators!

from pathlib import Path
import sys

mydir = Path('/home/pedro')
filelist = (filename for filename in mydir.rglob("*") if filename.is_file())
type(filelist) # generator
sys.getsizeof(filelist) # returns 104
total = sum(1 for f in filelist) # takes a couple of seconds then returns 193820

# show some of the files
filelist = (filename for filename in mydir.rglob("*") if filename.is_file())
for f in range(32):
    print(next(filelist))
Apparently, in the latest Python, pathlib has .walk() just like os (I don't have the latest Python!)

import pathlib    
path = pathlib.Path(r"E:\folder")
for root, dirs, files in path.walk():
    print("Root: ")
    print(root)
    print("Dirs: ")
    print(dirs)
    print("Files: ")
    print(files)
    print("")
What do you want to do with 70 million files??
Reply
#4
In addition to pathlib.iterdir(), you can use more_itertools.chunked()
« We can solve any problem by introducing an extra level of indirection »
Reply
#5
Can also use itertools.islice to slice into a generator.
So here load only files eg 5-10 or 32-256 into memory.
from pathlib import Path
from itertools import islice

def generate_paths(directory):
    for path in Path(directory).rglob('*'):
        if path.is_file():
            yield path

if __name__ == '__main__':
    dest = r'C:\Test'
    # Slice into the generator to get files in range specified
    selected_files = islice(generate_paths(dest), 5, 11)
    for path in selected_files:
        print(path)
Reply
#6
(May-12-2024, 05:14 AM)Pedroski55 Wrote: What do you want to do with 70 million files??
reduce it to about 700 files or maybe even fewer.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
(May-12-2024, 03:15 AM)deanhystad Wrote: Don’t use os.listdir. Use pathlib.iterdir
it gives me only ONE (1) at a time. i guess that's what "iter" implies. this is going to take "forever". is there a way to get like 256 at a time, or at least do one input from the directory per block that the names are stored on?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#8
the desire to get 32 to 256 at a time is not so i can have a loop do one at a time. it's so i can get all the names from a directory block with a single physical read operation. i created a test directory and was able to put 243 files into a single block of a directory.

re-phrased: my goal is to read the entire directory as fast as possible to acquire the list of names and write that list into a file. then i will run things to filter that huge list down to the few files i actually need, based only on the particular name fitting a collection of patterns. i don't need to open any of these files, yet.

hmmm, how to open a directory as a file in Python? trivial in C. never done this in Python. maybe os.open() and os.read().
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#9
(May-15-2024, 02:22 AM)Skaperen Wrote: how to open a directory as a file in Python? trivial in C.
How do you do that in C? Isn't it a call to opendir() and a loop of calls to readdir() ?

Hm, ChatGpt told me one can read a directory in C with scandir(). In your case however it would use malloc to allocate 70 millions character strings. I don't see how you could get only chunks of 256 entries for exampl.
« We can solve any problem by introducing an extra level of indirection »
Reply
#10
(May-15-2024, 05:25 PM)Gribouillis Wrote:
(May-15-2024, 02:22 AM)Skaperen Wrote: how to open a directory as a file in Python? trivial in C.
How do you do that in C? Isn't it a call to opendir() and a loop of calls to readdir() ?
to open a directory as a file you simply do the steps you would do if it is a regular file, open() and read(). doing "the file way" in Python, on a directory, raises IsADirectoryError.
(May-15-2024, 05:25 PM)Gribouillis Wrote: Hm, ChatGpt told me one can read a directory in C with scandir(). In your case however it would use malloc to allocate 70 millions character strings. I don't see how you could get only chunks of 256 entries for exampl.
i would not use scandir() (i have never used it for any purpose). if i were doing this in C, i would use read(), if i opened it with open(). i think readdir() buffers whatever it gets when it does read() instead of the whole directory all at once, which could make it usable (instead of duplicating code to slice up a directory). i need to try more things with Python, first, before i drop back to C to do this. i have zero experience mixing C and Python.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  os.listdir() and follow_symlinks Skaperen 6 474 May-24-2024, 03:07 AM
Last Post: Skaperen
  Short code for EventGhost not working Patricia 8 3,866 Feb-09-2021, 07:49 PM
Last Post: Patricia
  How can I make a short-key in Spyder (Python IDE)? moose 3 2,796 Nov-02-2020, 12:13 PM
Last Post: jefsummers
  listdir on IP Adress OEMS1 3 3,001 Jul-19-2020, 06:01 PM
Last Post: bowlofred
  Short font question Pizzas391 9 3,453 Nov-27-2019, 05:57 PM
Last Post: ichabod801
  trouble with os.listdir on a network drive lconner 10 19,422 Jun-04-2019, 07:16 PM
Last Post: DeaD_EyE
  os.listdir(path) and non-string as input metalray 4 16,994 Aug-15-2018, 11:43 AM
Last Post: metalray
  listdir trouble Dixon 1 2,716 Jan-17-2018, 11:32 PM
Last Post: micseydel
  Can I upload a new version without previously deleting ancient version sylas 6 4,417 Nov-08-2017, 03:26 PM
Last Post: Larz60+
  float.hex() is one bit short Skaperen 4 4,201 Jul-26-2017, 03:53 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020