Python Forum
Why the multithread does not reduce the execution time?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Why the multithread does not reduce the execution time?
#1
Hello everyone,
I explain my situation to you, maybe someone can clear me up and put a piece of my ignorance Confused .
I had to recover photos from a broken hdd, so I recovered the undamaged sectors and transferred them to an .img file
Through Scalpel (using linux mint), I extracted more than 350GB of images, including many images that were not needed because they were too small.
Realizing the unclean quantity of work, I thought well that I couldn't control 350GB of images by hand...
So I created a script, very quickly, written quite badly, but that works and respects my needs:
In essence, I have 304 folders, with many jpg files for each folder.
The program looks for "large" and working jpg images for each folder.
I am attaching the first code:
import os, time, shutil, tarfile #fnmatch
from PIL import Image
import shutil

all_files=[]
sext = []
temp = ""
print(time.strftime("%H|%M|%S"))

#chiedo all'utente che estensioni vuole cercare, in questo caso a me interessano i jpg, quindi inserisco: .jpg 
#dopo aver scelto le estensioni, scrivo 0 e invio
while True:
    sext.append(input("cerca: "))
    if sext[-1] == "0":
        del sext[-1]
        break
sext = tuple(sext)

#naviga fra le cartelle di Files recuperati, e cerca tutti i file che finiscono con "sext", è una tupla di estensioni, in questo caso, mi interessa solo ".jpg"
for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
    for x in filenames:
        if x.endswith(sext):
            fileDaAnalizzare = parent+'/'+x

            #apre il file con estensione specificata, e verifica che l'immagine abbia una certa grandezza
            try:
                im = Image.open(fileDaAnalizzare)
                width, height = im.size
                if(width > 350 and height >350):
                    document_path = os.path.join(parent,x)
                    print(document_path)

                    #copio semplicemente l'immagine che rispetta le mie esigenze nella cartella grandi
                    shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
               
            except:
                pass
            

print(time.strftime("%H|%M|%S"))
Here, I could also stop here, the program is very basic, it works and respects my needs, but then I asked myself:
If there were more threads doing this, and not one, would time be significantly reduced?
(let's remember that I have 350GB to compare ...)
So, always fast (and obviously badly written), I wrote the same program, modified to work as a thread,
The code:
import re, os, threading, sys, shutil, random
from PIL import Image

nThreadz = 0
threadz = []

while (nThreadz <= 0):
    nThreadz = int(input("numero di thread: "))
all_files=[]
sext = []
temp = ""

while True:
    sext.append(input("cerca: "))
    if sext[-1] == "0":
        del sext[-1]
        break
sext = tuple(sext)    



listMatch = []

for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
    listMatch.append(parent)


print("attualmente, %d siti" %len(listMatch))

class scan(threading.Thread):

    def __init__(self, group=None, target=None, name=None, args=(), kwargs=None, daemon=None):
        threading.Thread.__init__(self, group=group, target=target, name=name, daemon=daemon)
        self.args = args
        self.kwargs = kwargs
        return
    def run(self):
        #print(self.args)
        global nThreadz, sext
        #non è detto che il numero di thread sia divisibile con i listMatch

        if (self.args != (nThreadz-1)):
            vadoDa = self.args*(len(listMatch)//nThreadz)
            vadoA = (self.args+1)*(len(listMatch)//nThreadz)
            #print("vado da "+str(vadoDa)+" a "+str(vadoA))
        else:
        
            vadoDa = self.args*(len(listMatch)//nThreadz)
            vadoA = len(listMatch)-1
            #print("vado da "+str(vadoDa)+" a "+str(vadoA))
           #ogni thread cerca in una cartella (304 il totale delle cartelle), e si divide il lavoro
        
        for percorso in listMatch[vadoDa:vadoA]:
            for parent, directories, filenames in os.walk(percorso):
                for x in filenames:
                    if x.endswith(sext):
                        fileDaAnalizzare = parent+'/'+x
                        #print(parent+'/'+x)
                        #apre il file con estensione specificata, e cerca quel pezzo di codice o frase
                        try:
                            im = Image.open(fileDaAnalizzare)
                            width, height = im.size

                            if(width > 350 and height >350):
                                document_path = os.path.join(parent,x)
                                #print('trovata: '+document_path)
                                if(not(os.path.exists('/media/mionomeutente/PENNA USB/grandi/'+x))):
                                    shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
                                else:
                                    nomeRandom = random.randint(0,1000000000)
                                    nomeRandom = str(nomeRandom)+x
                                    shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi/'+nomeRandom)

               
                        except:
                            pass
            

        #sys.exit(0)
            


for x in range(0,nThreadz):
    threadz.append(scan(args=(x)))


for x in range(0,nThreadz):
    threadz[x].start()
The operation is identical to the previous script, with the difference that I can choose the amount of threads to "divide" the work ... I thought.
Actually I got very strange results:
as the number of thredes increased, the sieved files did not increase significantly, but .... decreased: Dodgy
1 thread = 3181 elements in 30 seconds
2 threads = 648 elements in 30 seconds
3 threads = 764 elements in 30 seconds
304 threads = 166 items in 30 seconds
I hope someone can find my topic interesting, and that this "mystery" has been clarified to me;)
(I don't clearly exclude the possibility that I made some mistakes)
Reply
#2
for the main program, try running:
import cprofile
import os, time, shutil, tarfile #fnmatch
from PIL import Image
import shutil
 
def main():
    all_files=[]
    sext = []
    temp = ""
    print(time.strftime("%H|%M|%S"))
    
    #chiedo all'utente che estensioni vuole cercare, in questo caso a me interessano i jpg, quindi inserisco: .jpg 
    #dopo aver scelto le estensioni, scrivo 0 e invio
    while True:
        sext.append(input("cerca: "))
        if sext[-1] == "0":
            del sext[-1]
            break
    sext = tuple(sext)
    
    #naviga fra le cartelle di Files recuperati, e cerca tutti i file che finiscono con "sext", è una tupla di estensioni, in questo caso, mi interessa solo ".jpg"
    for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
        for x in filenames:
            if x.endswith(sext):
                fileDaAnalizzare = parent+'/'+x
    
                #apre il file con estensione specificata, e verifica che l'immagine abbia una certa grandezza
                try:
                    im = Image.open(fileDaAnalizzare)
                    width, height = im.size
                    if(width > 350 and height >350):
                        document_path = os.path.join(parent,x)
                        print(document_path)
    
                        #copio semplicemente l'immagine che rispetta le mie esigenze nella cartella grandi
                        shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
                    
                except:
                    pass
                
    
    print(time.strftime("%H|%M|%S"))


if __name__ == '__main__':
    cProfile.run('main()')
And see if you can identify the bottleneck.
then try same with second script
Reply
#3
I understood the reason, I didn't know the concept of GIL in python, really well explained also by this GIF:
[Image: 1*wd0z1C75VsxD42QdKqCjpA.gif]
The question then is the following, a simple language like python, but which does not implement the GIL limitation, does it exist?
(I'm not talking about multiprocessing, which in python should work very well, and bypasses the GIL limitation).
I wrote the same script in C, and times are obviously infinitely inferior, C supports multithreading programming well, but a higher-level alternative exists?
For example, is GO a valid language when I want to do asynchronous programming?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How can I multithread to optimize a groupby task: davisc4468 0 675 Jun-30-2023, 02:45 PM
Last Post: davisc4468
  reduce nested for-loops Phaze90 11 1,763 Mar-16-2023, 06:28 PM
Last Post: ndc85430
  Adding values with reduce() function from the list of tuples kinimod 10 2,518 Jan-24-2023, 08:22 AM
Last Post: perfringo
  How can histogram bins be separated and reduce number of labels printed on x-axis? cadena 1 853 Sep-07-2022, 09:47 AM
Last Post: Larz60+
  How to measure execution time of a multithread loop spacedog 2 2,838 Apr-24-2021, 07:52 AM
Last Post: spacedog
  How do I reduce the time to Invoke Macro via Python? JaneTan 1 2,080 Dec-28-2020, 06:46 AM
Last Post: buran
  PyCharm Script Execution Time? muzikman 3 8,358 Dec-14-2020, 11:22 PM
Last Post: muzikman
  How to reduce the following code to run in sequence? Giggel 4 2,627 Jun-28-2020, 01:31 AM
Last Post: Giggel
  Help to reduce time to execute the code prakash52kar 1 2,196 Oct-14-2019, 10:56 AM
Last Post: scidam
  Time execution of a "for loop" with zip different in 2 equivalent context sebastien 1 1,882 Oct-11-2019, 11:07 AM
Last Post: sebastien

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020