multiprocessing

srik · (This post was last modified: Apr-07-2018, 12:29 PM by srik.)

I am given a data set and I need to analyze the statistics of a column based on the groups formed from other column unique values. I could do it using groupby of pandas. But I want to use multiprocessing Pool map. The example is as below.

A B C
2 3 4
2 5 3
2 3 5
2 7 9
2 3 10
3 4 23
2 7 4

Based on A and B combination unique values, I need to get mean of column C.

#df.groupby(['A', 'B'])['C'].mean()#

worked, but want to use Pool.map(). I am unable to get the idea of solving it. Please give some pointers on this.

woooee · (This post was last modified: Apr-07-2018, 06:30 PM by woooee.)

What exactly don't you understand and what have you tried that didn't work.

srik · Apr-09-2018, 07:06 AM

The way I tried is as follows:

1) get unique combinations of A and B

ls1 = df[['A', 'B']].drop_duplicates().values.tolist()

2) get C values for every combination in a list

s = [np.array([tuple(l) == tuple(t) for t in df[['A', 'B']].itertuples(index = False)]) for l in ls1] 
val_C = [list(df.loc[a1, 'C'].values) for a1 in s]

3) And apply Pool.map on the list of lists.

My question is
1) Is this a good approach in solving it?
2) My step 2 seems to be taking huge time. (How to optimize it???)

multiprocessing

User Panel Messages

Announcements