Hello everyone,
I explain my situation to you, maybe someone can clear me up and put a piece of my ignorance
.
I had to recover photos from a broken hdd, so I recovered the undamaged sectors and transferred them to an .img file
Through Scalpel (using linux mint), I extracted more than 350GB of images, including many images that were not needed because they were too small.
Realizing the unclean quantity of work, I thought well that I couldn't control 350GB of images by hand...
So I created a script, very quickly, written quite badly, but that works and respects my needs:
In essence, I have 304 folders, with many jpg files for each folder.
The program looks for "large" and working jpg images for each folder.
I am attaching the first code:
If there were more threads doing this, and not one, would time be significantly reduced?
(let's remember that I have 350GB to compare ...)
So, always fast (and obviously badly written), I wrote the same program, modified to work as a thread,
The code:
Actually I got very strange results:
as the number of thredes increased, the sieved files did not increase significantly, but .... decreased:
1 thread = 3181 elements in 30 seconds
2 threads = 648 elements in 30 seconds
3 threads = 764 elements in 30 seconds
304 threads = 166 items in 30 seconds
I hope someone can find my topic interesting, and that this "mystery" has been clarified to me;)
(I don't clearly exclude the possibility that I made some mistakes)
I explain my situation to you, maybe someone can clear me up and put a piece of my ignorance

I had to recover photos from a broken hdd, so I recovered the undamaged sectors and transferred them to an .img file
Through Scalpel (using linux mint), I extracted more than 350GB of images, including many images that were not needed because they were too small.
Realizing the unclean quantity of work, I thought well that I couldn't control 350GB of images by hand...
So I created a script, very quickly, written quite badly, but that works and respects my needs:
In essence, I have 304 folders, with many jpg files for each folder.
The program looks for "large" and working jpg images for each folder.
I am attaching the first code:
import os, time, shutil, tarfile #fnmatch from PIL import Image import shutil all_files=[] sext = [] temp = "" print(time.strftime("%H|%M|%S")) #chiedo all'utente che estensioni vuole cercare, in questo caso a me interessano i jpg, quindi inserisco: .jpg #dopo aver scelto le estensioni, scrivo 0 e invio while True: sext.append(input("cerca: ")) if sext[-1] == "0": del sext[-1] break sext = tuple(sext) #naviga fra le cartelle di Files recuperati, e cerca tutti i file che finiscono con "sext", è una tupla di estensioni, in questo caso, mi interessa solo ".jpg" for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"): for x in filenames: if x.endswith(sext): fileDaAnalizzare = parent+'/'+x #apre il file con estensione specificata, e verifica che l'immagine abbia una certa grandezza try: im = Image.open(fileDaAnalizzare) width, height = im.size if(width > 350 and height >350): document_path = os.path.join(parent,x) print(document_path) #copio semplicemente l'immagine che rispetta le mie esigenze nella cartella grandi shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi') except: pass print(time.strftime("%H|%M|%S"))Here, I could also stop here, the program is very basic, it works and respects my needs, but then I asked myself:
If there were more threads doing this, and not one, would time be significantly reduced?
(let's remember that I have 350GB to compare ...)
So, always fast (and obviously badly written), I wrote the same program, modified to work as a thread,
The code:
import re, os, threading, sys, shutil, random from PIL import Image nThreadz = 0 threadz = [] while (nThreadz <= 0): nThreadz = int(input("numero di thread: ")) all_files=[] sext = [] temp = "" while True: sext.append(input("cerca: ")) if sext[-1] == "0": del sext[-1] break sext = tuple(sext) listMatch = [] for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"): listMatch.append(parent) print("attualmente, %d siti" %len(listMatch)) class scan(threading.Thread): def __init__(self, group=None, target=None, name=None, args=(), kwargs=None, daemon=None): threading.Thread.__init__(self, group=group, target=target, name=name, daemon=daemon) self.args = args self.kwargs = kwargs return def run(self): #print(self.args) global nThreadz, sext #non è detto che il numero di thread sia divisibile con i listMatch if (self.args != (nThreadz-1)): vadoDa = self.args*(len(listMatch)//nThreadz) vadoA = (self.args+1)*(len(listMatch)//nThreadz) #print("vado da "+str(vadoDa)+" a "+str(vadoA)) else: vadoDa = self.args*(len(listMatch)//nThreadz) vadoA = len(listMatch)-1 #print("vado da "+str(vadoDa)+" a "+str(vadoA)) #ogni thread cerca in una cartella (304 il totale delle cartelle), e si divide il lavoro for percorso in listMatch[vadoDa:vadoA]: for parent, directories, filenames in os.walk(percorso): for x in filenames: if x.endswith(sext): fileDaAnalizzare = parent+'/'+x #print(parent+'/'+x) #apre il file con estensione specificata, e cerca quel pezzo di codice o frase try: im = Image.open(fileDaAnalizzare) width, height = im.size if(width > 350 and height >350): document_path = os.path.join(parent,x) #print('trovata: '+document_path) if(not(os.path.exists('/media/mionomeutente/PENNA USB/grandi/'+x))): shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi') else: nomeRandom = random.randint(0,1000000000) nomeRandom = str(nomeRandom)+x shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi/'+nomeRandom) except: pass #sys.exit(0) for x in range(0,nThreadz): threadz.append(scan(args=(x))) for x in range(0,nThreadz): threadz[x].start()The operation is identical to the previous script, with the difference that I can choose the amount of threads to "divide" the work ... I thought.
Actually I got very strange results:
as the number of thredes increased, the sieved files did not increase significantly, but .... decreased:

1 thread = 3181 elements in 30 seconds
2 threads = 648 elements in 30 seconds
3 threads = 764 elements in 30 seconds
304 threads = 166 items in 30 seconds
I hope someone can find my topic interesting, and that this "mystery" has been clarified to me;)
(I don't clearly exclude the possibility that I made some mistakes)