Python Forum
pymongo diff type problem to find images on two drives - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: pymongo diff type problem to find images on two drives (/thread-32859.html)



pymongo diff type problem to find images on two drives - darter - Mar-11-2021

I would like to compare two photo collections on two different hard drives so that I can locate any images on drive 2 that are not already on drive 1 and copy them to drive 1. I already have two mongodb collections and each document has a hash value from each image so I can compare by this hash.The hashing was done earlier. My code seems very slow ( maybe 40 minutes to check 40000 documents against 40000 others). So I am wondering how to do this faster. Like one minute? Anyway I just coded what seemed like a simple way to do this and now I would like to know how to see what operation is consuming the most time and also I think I need a whole new approach here. Maybe mongodb can do this super quick on it's own?

z2 = coll2.find({})
            count=0
            for i in z2:
                somehash = i["hashvalue"]  # so get the a hash value from the old hard drive to be compared against the new drive contents
                z1 = coll.find({ "hashvalue": somehash  }) # can we find the hash from drive 2 collection on the drive 1 collection ?
                print ('query count is ', z1.count())

                if z1.count() >0 :   # we did get a match so there is a duplicate
                    bool= True
                else:
                    bool = False   #  the image on drive 2 does not have a match on the new drive 1 (so inspect it later)

                coll2.update_one(
                    {'_id': i['_id']},
                    {
                        '$set': {'isdupe': bool}
                    }
                )
I was wondering if I should have just built an array of hashes from each collection and then just worked with that in python alone (maybe just using sets) and then when I had found the non-duplicate hashes go and find those back in the data. Also I did not sort anything but I thought the mongodb search would not be hampered by that. Maybe I'm wrong about that. Thanks.