multiprocessing - Printable Version

multiprocessing - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: multiprocessing (/thread-9416.html)

multiprocessing - srik - Apr-07-2018

I am given a data set and I need to analyze the statistics of a column based on the groups formed from other column unique values. I could do it using groupby of pandas. But I want to use multiprocessing Pool map. The example is as below.

A B C
2 3 4
2 5 3
2 3 5
2 7 9
2 3 10
3 4 23
2 7 4

Based on A and B combination unique values, I need to get mean of column C.

#df.groupby(['A', 'B'])['C'].mean()#

worked, but want to use Pool.map(). I am unable to get the idea of solving it. Please give some pointers on this.

RE: multiprocessing - woooee - Apr-07-2018

What exactly don't you understand and what have you tried that didn't work.

RE: multiprocessing - srik - Apr-09-2018

The way I tried is as follows:

1) get unique combinations of A and B

ls1 = df[['A', 'B']].drop_duplicates().values.tolist()

2) get C values for every combination in a list

s = [np.array([tuple(l) == tuple(t) for t in df[['A', 'B']].itertuples(index = False)]) for l in ls1] 
val_C = [list(df.loc[a1, 'C'].values) for a1 in s]

3) And apply Pool.map on the list of lists.

My question is
1) Is this a good approach in solving it?
2) My step 2 seems to be taking huge time. (How to optimize it???)