Python Forum

Full Version: multiprocessing
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am given a data set and I need to analyze the statistics of a column based on the groups formed from other column unique values. I could do it using groupby of pandas. But I want to use multiprocessing Pool map. The example is as below.

A B C
2 3 4
2 5 3
2 3 5
2 7 9
2 3 10
3 4 23
2 7 4

Based on A and B combination unique values, I need to get mean of column C.
#df.groupby(['A', 'B'])['C'].mean()#
worked, but want to use Pool.map(). I am unable to get the idea of solving it. Please give some pointers on this.
What exactly don't you understand and what have you tried that didn't work.
The way I tried is as follows:

1) get unique combinations of A and B
ls1 = df[['A', 'B']].drop_duplicates().values.tolist()
2) get C values for every combination in a list
s = [np.array([tuple(l) == tuple(t) for t in df[['A', 'B']].itertuples(index = False)]) for l in ls1] 
val_C = [list(df.loc[a1, 'C'].values) for a1 in s]
3) And apply Pool.map on the list of lists.

My question is
1) Is this a good approach in solving it?
2) My step 2 seems to be taking huge time. (How to optimize it???)