Python Forum
How to run multiple threads properly
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to run multiple threads properly
#1
I am trying to run a piece of code that converts csv files to xlsx, adds headers and auto-fits columns. Each "block" checks the file name and then does the above for each file. There are 23 files with sizes from 14kB to 437mB. This process is sequential and takes 1449 seconds.

To speed things up, I wanted to use mutlithreading and thereby process all files at the same time and when done, proceed with the rest of the code. So far I have tried three approaches, but in all cases, I do not get the results I aimed for.

1.) The first approach, uses Process. It doesn't wait for all threads to finish and just runs on; it basically runs over itself and doesn't work at all for what I had intended.

if __name__=='__main__':   

    p01 = Process(target = PR01)
    .
    .
    .

    p01.start()
    .
    .
    .

    p21.join()
    .
    .
    .
2.) The second approach uses Thread, gives error
Error:
CoInitialize has not been called., None, None)
so I add pythoncom.CoInitialize() into the PR functions [and also tried in combination with pythoncom.CoUninitialize()]. Then it runs, but it doesn't seem to run simultaneously, although it does seem to wait for threads to finish. This method however is about 3 times slower than running the code without functions and just straight forward sequentially :

p01 = Thread(name='PR01', target=PR01)
.
.
.

p01.start()
.
.
.

p01.join()
.
.
.
3.) The last approach was simply appending each to a list and then using a loop to start and join the threads either using Process or Thread :

Threads = []

Threads.append(Thread(name='PR01', target=PR01))
.
.
.

for x in Threads:
    x.start()

for x in Threads:
    x.join()
How can I get my 23 functions to run simultaneously, wait for all processes to finish (in essence, wait for the longest process to complete) ?

1631 seconds = sequentially with or without functions

6903 seconds = with Thread instead of Process
Reply
#2
import threading

def csv2xlsx(file_name):
    # procces the csv
    # save the xlsx

_ = [threading.Thread(target=csv2xlsx, args=(file_name,)).start() for file_name in files_list]
Something like that? I've never used threading module but I tried to put together something looking at the docs a minute ago. Hope this will work.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
(Dec-22-2017, 07:38 AM)wavic Wrote:
import threading

def csv2xlsx(file_name):
    # procces the csv
    # save the xlsx

_ = [threading.Thread(target=csv2xlsx, args=(file_name,)).start() for file_name in files_list]
Something like that? I've never used threading module but I tried to put together something looking at the docs a minute ago. Hope this will work.
Thanks for your reply. My functions run like this

def PR01()
    # process file

PR01()
PR02()
PR03()
PR04()
The reason why I am using this approach is because each function is somewhat unique and on each function I receive feedback on how long it took to run. The largest file takes 11 minutes to generate for example. To clarify ; in each function :

1.) file might be be renamed
2.) file might be converted to XLSX
3.) file might be receive a header
4.) file might be autofit
Reply
#4
(Dec-22-2017, 08:20 AM)cyberion1985 Wrote: Thanks for your reply. My functions run like this

def PR01()
    # process file

PR01()
PR02()
PR03()
PR04()
The reason why I am using this approach is because each function is somewhat unique and on each function I receive feedback on how long it took to run. The largest file takes 11 minutes to generate for example. To clarify ; in each function :

1.) file might be be renamed
2.) file might be converted to XLSX
3.) file might be receive a header
4.) file might be autofit

So each file gets processed with PR01 then passed on to PR02, then to PR03, etc.? If so, you have yourself a pipeline. (https://www.cise.ufl.edu/research/Parall...peline.htm)

If you're not doing a pipeline and your processing doesn't rely on global state of the program, you might consider using the subprocess module and do the work in individual processes instead of threads. You might also consider the multiprocessing module (https://docs.python.org/3/library/multiprocessing.html). I have never used it and have no idea if it would suit your needs.
Reply
#5
You may have many functions which are doing something but... You somehow determine what to do with the file, right? Why not use one function and according to some conditions call the other functions to do the job. Also, you may use logging or just another file to store the execution time.

The structure of your program is important. It allows you to do changes when it's necessary easily. One function to determine the conditions and call whatever function is needed according to these conditions from within. And pass it to the Threads' target attribute.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#6
okay, so after some time adopting the above approach, I can finally reply.

Cry This doesn't work :

Quote:
_=[threading.Thread(target=csv2xlsx, args=(file_name,)).start()forfile_nameinfiles_list]

It might work as 

    for file_name in CSVFILES:
        threading.Thread(target=MAIN_CSV, args=(file_name,)).start()
...which I will go and test now :

main_INIT.py (main init and core script - will run the below scripts as threads)
main_process_PDF.py
main_process_CSV.py (processes the CSV files simultaneously with threads) 

So in essence :

#this is main_INIT.py
#do something

#run function from main_process_PDF.py
#run function from main_process_CSV.py
#is function from main_process_PDF.py done ? is function from main_process_CSV.py done ? If both = yes, continue, else = wait .

#do something
This doesn't work. In both cases, running it alone or inside a function, the CSV processes will not wait until all are done. They will simply run over to the next part of the script.  Cry

...then I also tried

[python]
    for file_name in CSVFILES:
        threading.Thread(target=MAIN_CSV, args=(file_name,)).start()
        threading.Thread(target=MAIN_CSV, args=(file_name,)).join()
[python/]
Reply
#7
from threading import Thread

import time

def PR01():
    k = 0
    while (k < 25000000):
        k = k + 1
        
def PR02():
    k = 0
    while (k < 25000000):
        k = k + 1        
    
# defining the time now in epoch format
def thetime():
    return int(time.time())
    
print "only start here <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"   

task_start = thetime()

k = 0
while (k < 25000000):
    k = k + 1
    
k = 0
while (k < 25000000):
    k = k + 1


task_end = "%05d" % (thetime() - task_start,)
print("original = " + str(task_end))            

task_start = thetime()
p01 = Thread(name='PR01', target=PR01)
p02 = Thread(name='PR02', target=PR02)

p01.start()
p01.join()

p02.start()
p02.join()

task_end = "%05d" % (thetime() - task_start,)
print("processed 1 = " + str(task_end))    

print "only done here <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to separate a loop across multiple threads stylingpat 0 1,672 May-05-2021, 05:21 PM
Last Post: stylingpat
  Get average of multiple threads Contra_Boy 1 15,952 May-05-2020, 04:51 PM
Last Post: deanhystad
  stop multiple threads jeuvrey 5 3,373 Nov-15-2018, 01:34 PM
Last Post: jeuvrey
  Quitting multiple threads MuntyScruntfundle 3 2,658 Oct-17-2018, 05:14 AM
Last Post: volcano63

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020