Mar-11-2021, 04:52 AM
I would like to compare two photo collections on two different hard drives so that I can locate any images on drive 2 that are not already on drive 1 and copy them to drive 1. I already have two mongodb collections and each document has a hash value from each image so I can compare by this hash.The hashing was done earlier. My code seems very slow ( maybe 40 minutes to check 40000 documents against 40000 others). So I am wondering how to do this faster. Like one minute? Anyway I just coded what seemed like a simple way to do this and now I would like to know how to see what operation is consuming the most time and also I think I need a whole new approach here. Maybe mongodb can do this super quick on it's own?
z2 = coll2.find({}) count=0 for i in z2: somehash = i["hashvalue"] # so get the a hash value from the old hard drive to be compared against the new drive contents z1 = coll.find({ "hashvalue": somehash }) # can we find the hash from drive 2 collection on the drive 1 collection ? print ('query count is ', z1.count()) if z1.count() >0 : # we did get a match so there is a duplicate bool= True else: bool = False # the image on drive 2 does not have a match on the new drive 1 (so inspect it later) coll2.update_one( {'_id': i['_id']}, { '$set': {'isdupe': bool} } )I was wondering if I should have just built an array of hashes from each collection and then just worked with that in python alone (maybe just using sets) and then when I had found the non-duplicate hashes go and find those back in the data. Also I did not sort anything but I thought the mongodb search would not be hampered by that. Maybe I'm wrong about that. Thanks.