Posts: 4,653
Threads: 1,496
Joined: Sep 2016
is there a way to shorten os.listdir() such as to have it only read several (like maybe 32 to 256) names at a time? i need to scan through a massively huge directory and it is have trouble with it being so big. the directory has well over 70 million files.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 6,809
Threads: 20
Joined: Feb 2020
Don’t use os.listdir. Use pathlib.iterdir
Posts: 1,094
Threads: 143
Joined: Jul 2017
May-12-2024, 05:14 AM
(This post was last modified: May-12-2024, 05:14 AM by Pedroski55.)
Fun with generators!
from pathlib import Path
import sys
mydir = Path('/home/pedro')
filelist = (filename for filename in mydir.rglob("*") if filename.is_file())
type(filelist) # generator
sys.getsizeof(filelist) # returns 104
total = sum(1 for f in filelist) # takes a couple of seconds then returns 193820
# show some of the files
filelist = (filename for filename in mydir.rglob("*") if filename.is_file())
for f in range(32):
print(next(filelist)) Apparently, in the latest Python, pathlib has .walk() just like os (I don't have the latest Python!)
import pathlib
path = pathlib.Path(r"E:\folder")
for root, dirs, files in path.walk():
print("Root: ")
print(root)
print("Dirs: ")
print(dirs)
print("Files: ")
print(files)
print("") What do you want to do with 70 million files??
Posts: 4,801
Threads: 77
Joined: Jan 2018
In addition to pathlib.iterdir() , you can use more_itertools.chunked()
« We can solve any problem by introducing an extra level of indirection »
Posts: 7,324
Threads: 123
Joined: Sep 2016
May-12-2024, 09:11 AM
(This post was last modified: May-12-2024, 09:11 AM by snippsat.)
Can also use itertools.islice to slice into a generator.
So here load only files eg 5-10 or 32-256 into memory.
from pathlib import Path
from itertools import islice
def generate_paths(directory):
for path in Path(directory).rglob('*'):
if path.is_file():
yield path
if __name__ == '__main__':
dest = r'C:\Test'
# Slice into the generator to get files in range specified
selected_files = islice(generate_paths(dest), 5, 11)
for path in selected_files:
print(path)
Posts: 4,653
Threads: 1,496
Joined: Sep 2016
(May-12-2024, 05:14 AM)Pedroski55 Wrote: What do you want to do with 70 million files?? reduce it to about 700 files or maybe even fewer.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,653
Threads: 1,496
Joined: Sep 2016
(May-12-2024, 03:15 AM)deanhystad Wrote: Don’t use os.listdir. Use pathlib.iterdir it gives me only ONE (1) at a time. i guess that's what "iter" implies. this is going to take "forever". is there a way to get like 256 at a time, or at least do one input from the directory per block that the names are stored on?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,653
Threads: 1,496
Joined: Sep 2016
the desire to get 32 to 256 at a time is not so i can have a loop do one at a time. it's so i can get all the names from a directory block with a single physical read operation. i created a test directory and was able to put 243 files into a single block of a directory.
re-phrased: my goal is to read the entire directory as fast as possible to acquire the list of names and write that list into a file. then i will run things to filter that huge list down to the few files i actually need, based only on the particular name fitting a collection of patterns. i don't need to open any of these files, yet.
hmmm, how to open a directory as a file in Python? trivial in C. never done this in Python. maybe os.open() and os.read().
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,801
Threads: 77
Joined: Jan 2018
May-15-2024, 05:25 PM
(This post was last modified: May-15-2024, 05:25 PM by Gribouillis.)
(May-15-2024, 02:22 AM)Skaperen Wrote: how to open a directory as a file in Python? trivial in C. How do you do that in C? Isn't it a call to opendir() and a loop of calls to readdir() ?
Hm, ChatGpt told me one can read a directory in C with scandir() . In your case however it would use malloc to allocate 70 millions character strings. I don't see how you could get only chunks of 256 entries for exampl.
« We can solve any problem by introducing an extra level of indirection »
Posts: 4,653
Threads: 1,496
Joined: Sep 2016
(May-15-2024, 05:25 PM)Gribouillis Wrote: (May-15-2024, 02:22 AM)Skaperen Wrote: how to open a directory as a file in Python? trivial in C. How do you do that in C? Isn't it a call to opendir() and a loop of calls to readdir() ? to open a directory as a file you simply do the steps you would do if it is a regular file, open() and read() . doing "the file way" in Python, on a directory, raises IsADirectoryError .
(May-15-2024, 05:25 PM)Gribouillis Wrote: Hm, ChatGpt told me one can read a directory in C with scandir() . In your case however it would use malloc to allocate 70 millions character strings. I don't see how you could get only chunks of 256 entries for exampl. i would not use scandir() (i have never used it for any purpose). if i were doing this in C, i would use read() , if i opened it with open() . i think readdir() buffers whatever it gets when it does read() instead of the whole directory all at once, which could make it usable (instead of duplicating code to slice up a directory). i need to try more things with Python, first, before i drop back to C to do this. i have zero experience mixing C and Python.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
|