Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 multithreading issue with output
My script is working correctly when i have only one thread.
when I make 50 threads, the output result in 'table' seems to have missing entries in each row ! for example column 'Average Volume' sometimes have missing results for some index. this issue is sporadic !
I am sure that each thread write in different index i.e the threads doesn't overwrite each others !

table = pd.DataFrame(index =tickers,columns = some_columns)

def processData(q,table):
    while not q.empty():
        ticker = q.get()
        if  bad_condition:
                if bad_condition:
#######A lot of code here               

                table.loc[ticker,'Shares Outstanding']=sharesOutstanding
                table.loc[ticker,'Average Volume']=averageVolume
        except urllib.error.HTTPError:
            print(ticker,'doesnt exist on yahoo finance')
        except urllib.error.URLError:
            print(ticker,'yahoo finance has issue')
    return True

num_theads = 50
q = Queue(maxsize=0)
for ticker in table.index:
for i in range(0,num_theads):
    worker = Thread(target=processData, args=(q,table))
Is there any regularity in the missing data points? Are they dispersed sporadically across the row or are they clustered toward the end of the row?

I'm not certain, but I doubt DataFrames are threadsafe which means one thread can interrupt another thread. That could be the issue. Based on the snippet I'm seeing, I imagine that interruption would affect the later fields that get filled in - such as Average Volume which is the last one listed in processData().
I think you are correct about safety using dataframe


They stated:
Quote:As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.

I can see that I don't use copy ! I only use write in certain location and no other thread is accessing this place.
anyway to fix this issue?
just wrong post that was deleted
One way to fix it would be to wrap the DataFrame in a threadsafe wrapper. Look into threading.Lock objects in the standard library. You should be able to implement something along the lines of:

class LockingFrame(DataFrame):
    lock = threading.Lock

    def access():
        [do stuff]
I'm sure someone else has encountered this issue too so there may be a "LockingFrame" out there already.

I have helped someone previously divide a DataFrame into several smaller DataFrames for processing and then bring them back together later. It's been a while and I do not have that code readily available.

How many rows are in the data set? Is it feasible to perform the operation without multithreading? Or perhaps describe the project in more detail and provide the full code of processData().

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  How can i add multithreading in this example WoodyWoodpecker1 3 124 Aug-11-2020, 05:30 PM
Last Post: deanhystad
  matplotlib multithreading catosp 0 250 Jul-03-2020, 09:33 AM
Last Post: catosp
  Multithreading dynamically syncronism Rodrigo 0 264 Nov-08-2019, 02:33 AM
Last Post: Rodrigo
  Locks in Multithreading Chuonon 0 291 Oct-03-2019, 04:16 PM
Last Post: Chuonon
  Multithreading alternative MartinV279 1 532 Aug-01-2019, 11:41 PM
Last Post: scidam
  Output issue twinpiques 6 688 Jul-29-2019, 11:24 PM
Last Post: Yoriz
  using locks in multithreading in python3 srm 2 644 Jul-13-2019, 11:35 AM
Last Post: noisefloor
  re.finditer issue, output is blank anna 1 531 Feb-07-2019, 10:41 AM
Last Post: stranac
  Paramiko output printing issue anna 3 8,686 Feb-06-2018, 08:34 AM
Last Post: anna
  Encoding issue for the console output ted_chou12 4 1,525 Sep-08-2017, 09:11 AM
Last Post: ted_chou12

Forum Jump:

Users browsing this thread: 1 Guest(s)