recursive file scan

Skaperen · Jan-02-2018, 03:17 AM

has anyone ever written a recursive file scanner based on os.scandir() that can recurse a file tree ? what about a scanner that can scan 2 or more file trees in parallel (for example, given a list, set or tuple of file tree starting paths), making it easy to verify if each file tree is like the others (has the same set of names).

i am also interested if they made the file scanner as a generator class (one could instantiate it for each file tree to be able to do parallel comparison).

***snippsat*** · (This post was last modified: Jan-02-2018, 04:09 AM by snippsat.)

Skaperen Wrote:has anyone ever written a recursive file scanner based on os.scandir() that can recurse a file tree ?

os.walk()(recursive) is build on os.scandir() after PEP 471.

pep 471 Wrote:As part of this proposal, os.walk() will also be modified to use scandir() rather than listdir() and os.path.isdir().
This will increase the speed of os.walk() very significantly (as mentioned above, by 2-20 times, depending on the system).

Also os.glob() is now also based on scandir.
The have made it recursive.

Quote:Changed in version 3.5: Support for recursive globs using “**”.

Skaperen · (This post was last modified: Jan-03-2018, 02:24 AM by Skaperen.)

ok, now, i would like to see example code using os.walk that walks through 2 file trees (considered to be nearly identical) in parallel and reports the differences in the names, what is missing and what is extra.

in particular i want to see the code that gets the next file from the walk.

consider the case of 2 huge file trees so large that the list of names in just 1 of them exceeds all available memory.

Skaperen · Jan-03-2018, 03:29 AM

something i often do is to end the file tree scan early, such as when a desired file is found (that is not just a simple name match). so i need to be able to step through the file tree, one name at a time, and i need to be able to be able to bring the scan to a graceful end an release all resources bound by the scan, such as releasing memory and restoring the current working directory (only if changed). i implemented such a thing in C.

imagine doing such a thing on a backup server which has always-mounted spinning replicas of dozens of machines and you need to search for a file which some given Python module can indicate is the desired file, and you have no idea which server it originated from.

**nilamo** · Jan-03-2018, 04:08 AM

It sort of sounds like you're describing rsync. Are you comparing file contents, or just names and paths?

Skaperen · (This post was last modified: Jan-04-2018, 02:57 AM by Skaperen.)

it is close to rsync, but will not be doing any copying. for my initial thing i am just wanting to look for differences in the names and paths. i do know i can walk each tree and get a list, repeat for 2nd tree, and compare, but that only scales so far. so i want to do it where i can get each name one at a time. i will be comparing up to 5 filesystems that are 8 Exabytes (8388608 Terabytes) in size, each.

the documentation says os.walk is a generator. shouldn't there be a way to get the next entry in the walk because it is a generator? can i treat it as an iterator?

wavic · Jan-04-2018, 07:37 AM

(Jan-04-2018, 02:57 AM)Skaperen Wrote: the documentation says os.walk is a generator. shouldn't there be a way to get the next entry in the walk because it is a generator? can i treat it as an iterator?

walk = os.walk('/home/')

root, dirs, files = next(walk)

**nilamo** · Jan-04-2018, 08:11 AM

Since the filesystems are massive, it'll probably need to write to stdout as it goes, instead of building a list of all differences, as there might not be enough ram to actually store all the diffs. And since there's multiple filesystems, I think I'd organize it as a client-server, with each client just connecting to the server, and emitting paths/filenames, then the server keeps track of what it's seen from where, removing entries once they're seen from all clients. I don't know if os.walk returns directories/files in any sort of order, if it's arbitrary you might need to do some sorting client side first to try to make it faster.

Skaperen · Jan-05-2018, 04:49 AM

it could start building a list and if it gets too large, output it, dereference it, and run the code that outputs as it goes.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	a future project: recursive file list	Skaperen	0	2,878	Dec-14-2017, 03:55 AM Last Post: Skaperen

recursive file scan

User Panel Messages

Announcements