Python Forum
Why the multithread does not reduce the execution time? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Why the multithread does not reduce the execution time? (/thread-22684.html)



Why the multithread does not reduce the execution time? - Nicely - Nov-22-2019

Hello everyone,
I explain my situation to you, maybe someone can clear me up and put a piece of my ignorance Confused .
I had to recover photos from a broken hdd, so I recovered the undamaged sectors and transferred them to an .img file
Through Scalpel (using linux mint), I extracted more than 350GB of images, including many images that were not needed because they were too small.
Realizing the unclean quantity of work, I thought well that I couldn't control 350GB of images by hand...
So I created a script, very quickly, written quite badly, but that works and respects my needs:
In essence, I have 304 folders, with many jpg files for each folder.
The program looks for "large" and working jpg images for each folder.
I am attaching the first code:
import os, time, shutil, tarfile #fnmatch
from PIL import Image
import shutil

all_files=[]
sext = []
temp = ""
print(time.strftime("%H|%M|%S"))

#chiedo all'utente che estensioni vuole cercare, in questo caso a me interessano i jpg, quindi inserisco: .jpg 
#dopo aver scelto le estensioni, scrivo 0 e invio
while True:
    sext.append(input("cerca: "))
    if sext[-1] == "0":
        del sext[-1]
        break
sext = tuple(sext)

#naviga fra le cartelle di Files recuperati, e cerca tutti i file che finiscono con "sext", è una tupla di estensioni, in questo caso, mi interessa solo ".jpg"
for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
    for x in filenames:
        if x.endswith(sext):
            fileDaAnalizzare = parent+'/'+x

            #apre il file con estensione specificata, e verifica che l'immagine abbia una certa grandezza
            try:
                im = Image.open(fileDaAnalizzare)
                width, height = im.size
                if(width > 350 and height >350):
                    document_path = os.path.join(parent,x)
                    print(document_path)

                    #copio semplicemente l'immagine che rispetta le mie esigenze nella cartella grandi
                    shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
               
            except:
                pass
            

print(time.strftime("%H|%M|%S"))
Here, I could also stop here, the program is very basic, it works and respects my needs, but then I asked myself:
If there were more threads doing this, and not one, would time be significantly reduced?
(let's remember that I have 350GB to compare ...)
So, always fast (and obviously badly written), I wrote the same program, modified to work as a thread,
The code:
import re, os, threading, sys, shutil, random
from PIL import Image

nThreadz = 0
threadz = []

while (nThreadz <= 0):
    nThreadz = int(input("numero di thread: "))
all_files=[]
sext = []
temp = ""

while True:
    sext.append(input("cerca: "))
    if sext[-1] == "0":
        del sext[-1]
        break
sext = tuple(sext)    



listMatch = []

for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
    listMatch.append(parent)


print("attualmente, %d siti" %len(listMatch))

class scan(threading.Thread):

    def __init__(self, group=None, target=None, name=None, args=(), kwargs=None, daemon=None):
        threading.Thread.__init__(self, group=group, target=target, name=name, daemon=daemon)
        self.args = args
        self.kwargs = kwargs
        return
    def run(self):
        #print(self.args)
        global nThreadz, sext
        #non è detto che il numero di thread sia divisibile con i listMatch

        if (self.args != (nThreadz-1)):
            vadoDa = self.args*(len(listMatch)//nThreadz)
            vadoA = (self.args+1)*(len(listMatch)//nThreadz)
            #print("vado da "+str(vadoDa)+" a "+str(vadoA))
        else:
        
            vadoDa = self.args*(len(listMatch)//nThreadz)
            vadoA = len(listMatch)-1
            #print("vado da "+str(vadoDa)+" a "+str(vadoA))
           #ogni thread cerca in una cartella (304 il totale delle cartelle), e si divide il lavoro
        
        for percorso in listMatch[vadoDa:vadoA]:
            for parent, directories, filenames in os.walk(percorso):
                for x in filenames:
                    if x.endswith(sext):
                        fileDaAnalizzare = parent+'/'+x
                        #print(parent+'/'+x)
                        #apre il file con estensione specificata, e cerca quel pezzo di codice o frase
                        try:
                            im = Image.open(fileDaAnalizzare)
                            width, height = im.size

                            if(width > 350 and height >350):
                                document_path = os.path.join(parent,x)
                                #print('trovata: '+document_path)
                                if(not(os.path.exists('/media/mionomeutente/PENNA USB/grandi/'+x))):
                                    shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
                                else:
                                    nomeRandom = random.randint(0,1000000000)
                                    nomeRandom = str(nomeRandom)+x
                                    shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi/'+nomeRandom)

               
                        except:
                            pass
            

        #sys.exit(0)
            


for x in range(0,nThreadz):
    threadz.append(scan(args=(x)))


for x in range(0,nThreadz):
    threadz[x].start()
The operation is identical to the previous script, with the difference that I can choose the amount of threads to "divide" the work ... I thought.
Actually I got very strange results:
as the number of thredes increased, the sieved files did not increase significantly, but .... decreased: Dodgy
1 thread = 3181 elements in 30 seconds
2 threads = 648 elements in 30 seconds
3 threads = 764 elements in 30 seconds
304 threads = 166 items in 30 seconds
I hope someone can find my topic interesting, and that this "mystery" has been clarified to me;)
(I don't clearly exclude the possibility that I made some mistakes)


RE: Why the multithread does not reduce the execution time? - Larz60+ - Nov-22-2019

for the main program, try running:
import cprofile
import os, time, shutil, tarfile #fnmatch
from PIL import Image
import shutil
 
def main():
    all_files=[]
    sext = []
    temp = ""
    print(time.strftime("%H|%M|%S"))
    
    #chiedo all'utente che estensioni vuole cercare, in questo caso a me interessano i jpg, quindi inserisco: .jpg 
    #dopo aver scelto le estensioni, scrivo 0 e invio
    while True:
        sext.append(input("cerca: "))
        if sext[-1] == "0":
            del sext[-1]
            break
    sext = tuple(sext)
    
    #naviga fra le cartelle di Files recuperati, e cerca tutti i file che finiscono con "sext", è una tupla di estensioni, in questo caso, mi interessa solo ".jpg"
    for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
        for x in filenames:
            if x.endswith(sext):
                fileDaAnalizzare = parent+'/'+x
    
                #apre il file con estensione specificata, e verifica che l'immagine abbia una certa grandezza
                try:
                    im = Image.open(fileDaAnalizzare)
                    width, height = im.size
                    if(width > 350 and height >350):
                        document_path = os.path.join(parent,x)
                        print(document_path)
    
                        #copio semplicemente l'immagine che rispetta le mie esigenze nella cartella grandi
                        shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
                    
                except:
                    pass
                
    
    print(time.strftime("%H|%M|%S"))


if __name__ == '__main__':
    cProfile.run('main()')
And see if you can identify the bottleneck.
then try same with second script


RE: Why the multithread does not reduce the execution time? - Nicely - Nov-23-2019

I understood the reason, I didn't know the concept of GIL in python, really well explained also by this GIF:
[Image: 1*wd0z1C75VsxD42QdKqCjpA.gif]
The question then is the following, a simple language like python, but which does not implement the GIL limitation, does it exist?
(I'm not talking about multiprocessing, which in python should work very well, and bypasses the GIL limitation).
I wrote the same script in C, and times are obviously infinitely inferior, C supports multithreading programming well, but a higher-level alternative exists?
For example, is GO a valid language when I want to do asynchronous programming?