Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
recursive file scan
#1
has anyone ever written a recursive file scanner based on os.scandir() that can recurse a file tree ? what about a scanner that can scan 2 or more file trees in parallel (for example, given a list, set or tuple of file tree starting paths), making it easy to verify if each file tree is like the others (has the same set of names).

i am also interested if they made the file scanner as a generator class (one could instantiate it for each file tree to be able to do parallel comparison).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Skaperen Wrote:has anyone ever written a recursive file scanner based on os.scandir() that can recurse a file tree ?
os.walk()(recursive) is build on os.scandir() after PEP 471.
pep 471 Wrote:As part of this proposal, os.walk() will also be modified to use scandir() rather than listdir() and os.path.isdir().
This will increase the speed of os.walk() very significantly (as mentioned above, by 2-20 times, depending on the system).
Also os.glob() is now also based on scandir.
The have made it recursive.
Quote:Changed in version 3.5: Support for recursive globs using “**”.
Reply
#3
ok, now, i would like to see example code using os.walk that walks through 2 file trees (considered to be nearly identical) in parallel and reports the differences in the names, what is missing and what is extra.

in particular i want to see the code that gets the next file from the walk.

consider the case of 2 huge file trees so large that the list of names in just 1 of them exceeds all available memory.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
something i often do is to end the file tree scan early, such as when a desired file is found (that is not just a simple name match). so i need to be able to step through the file tree, one name at a time, and i need to be able to be able to bring the scan to a graceful end an release all resources bound by the scan, such as releasing memory and restoring the current working directory (only if changed). i implemented such a thing in C.

imagine doing such a thing on a backup server which has always-mounted spinning replicas of dozens of machines and you need to search for a file which some given Python module can indicate is the desired file, and you have no idea which server it originated from.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
It sort of sounds like you're describing rsync. Are you comparing file contents, or just names and paths?
Reply
#6
it is close to rsync, but will not be doing any copying. for my initial thing i am just wanting to look for differences in the names and paths. i do know i can walk each tree and get a list, repeat for 2nd tree, and compare, but that only scales so far. so i want to do it where i can get each name one at a time. i will be comparing up to 5 filesystems that are 8 Exabytes (8388608 Terabytes) in size, each.

the documentation says os.walk is a generator. shouldn't there be a way to get the next entry in the walk because it is a generator? can i treat it as an iterator?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
(Jan-04-2018, 02:57 AM)Skaperen Wrote: the documentation says os.walk is a generator. shouldn't there be a way to get the next entry in the walk because it is a generator? can i treat it as an iterator?


walk = os.walk('/home/')

root, dirs, files = next(walk)
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
Since the filesystems are massive, it'll probably need to write to stdout as it goes, instead of building a list of all differences, as there might not be enough ram to actually store all the diffs. And since there's multiple filesystems, I think I'd organize it as a client-server, with each client just connecting to the server, and emitting paths/filenames, then the server keeps track of what it's seen from where, removing entries once they're seen from all clients. I don't know if os.walk returns directories/files in any sort of order, if it's arbitrary you might need to do some sorting client side first to try to make it faster.
Reply
#9
it could start building a list and if it gets too large, output it, dereference it, and run the code that outputs as it goes.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  a future project: recursive file list Skaperen 0 2,203 Dec-14-2017, 03:55 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020