Posts: 4,648
Threads: 1,495
Joined: Sep 2016
has anyone ever written a recursive file scanner based on os.scandir() that can recurse a file tree ? what about a scanner that can scan 2 or more file trees in parallel (for example, given a list, set or tuple of file tree starting paths), making it easy to verify if each file tree is like the others (has the same set of names).
i am also interested if they made the file scanner as a generator class (one could instantiate it for each file tree to be able to do parallel comparison).
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 7,320
Threads: 123
Joined: Sep 2016
Jan-02-2018, 04:09 AM
(This post was last modified: Jan-02-2018, 04:09 AM by snippsat.)
Skaperen Wrote:has anyone ever written a recursive file scanner based on os.scandir() that can recurse a file tree ? os.walk()(recursive) is build on os.scandir() after PEP 471.
pep 471 Wrote:As part of this proposal, os.walk() will also be modified to use scandir() rather than listdir() and os.path.isdir().
This will increase the speed of os.walk() very significantly (as mentioned above, by 2-20 times, depending on the system). Also os.glob() is now also based on scandir.
The have made it recursive.
Quote:Changed in version 3.5: Support for recursive globs using “**”.
Posts: 4,648
Threads: 1,495
Joined: Sep 2016
Jan-03-2018, 02:24 AM
(This post was last modified: Jan-03-2018, 02:24 AM by Skaperen.)
ok, now, i would like to see example code using os.walk that walks through 2 file trees (considered to be nearly identical) in parallel and reports the differences in the names, what is missing and what is extra.
in particular i want to see the code that gets the next file from the walk.
consider the case of 2 huge file trees so large that the list of names in just 1 of them exceeds all available memory.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,648
Threads: 1,495
Joined: Sep 2016
something i often do is to end the file tree scan early, such as when a desired file is found (that is not just a simple name match). so i need to be able to step through the file tree, one name at a time, and i need to be able to be able to bring the scan to a graceful end an release all resources bound by the scan, such as releasing memory and restoring the current working directory (only if changed). i implemented such a thing in C.
imagine doing such a thing on a backup server which has always-mounted spinning replicas of dozens of machines and you need to search for a file which some given Python module can indicate is the desired file, and you have no idea which server it originated from.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 3,458
Threads: 101
Joined: Sep 2016
It sort of sounds like you're describing rsync. Are you comparing file contents, or just names and paths?
Posts: 4,648
Threads: 1,495
Joined: Sep 2016
Jan-04-2018, 02:57 AM
(This post was last modified: Jan-04-2018, 02:57 AM by Skaperen.)
it is close to rsync, but will not be doing any copying. for my initial thing i am just wanting to look for differences in the names and paths. i do know i can walk each tree and get a list, repeat for 2nd tree, and compare, but that only scales so far. so i want to do it where i can get each name one at a time. i will be comparing up to 5 filesystems that are 8 Exabytes (8388608 Terabytes) in size, each.
the documentation says os.walk is a generator. shouldn't there be a way to get the next entry in the walk because it is a generator? can i treat it as an iterator?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 2,953
Threads: 48
Joined: Sep 2016
(Jan-04-2018, 02:57 AM)Skaperen Wrote: the documentation says os.walk is a generator. shouldn't there be a way to get the next entry in the walk because it is a generator? can i treat it as an iterator?
walk = os.walk('/home/')
root, dirs, files = next(walk)
Posts: 3,458
Threads: 101
Joined: Sep 2016
Since the filesystems are massive, it'll probably need to write to stdout as it goes, instead of building a list of all differences, as there might not be enough ram to actually store all the diffs. And since there's multiple filesystems, I think I'd organize it as a client-server, with each client just connecting to the server, and emitting paths/filenames, then the server keeps track of what it's seen from where, removing entries once they're seen from all clients. I don't know if os.walk returns directories/files in any sort of order, if it's arbitrary you might need to do some sorting client side first to try to make it faster.
Posts: 4,648
Threads: 1,495
Joined: Sep 2016
it could start building a list and if it gets too large, output it, dereference it, and run the code that outputs as it goes.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
|