Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
multiprocessing
#1
I am given a data set and I need to analyze the statistics of a column based on the groups formed from other column unique values. I could do it using groupby of pandas. But I want to use multiprocessing Pool map. The example is as below.

A B C
2 3 4
2 5 3
2 3 5
2 7 9
2 3 10
3 4 23
2 7 4

Based on A and B combination unique values, I need to get mean of column C.
#df.groupby(['A', 'B'])['C'].mean()#
worked, but want to use Pool.map(). I am unable to get the idea of solving it. Please give some pointers on this.
Reply
#2
What exactly don't you understand and what have you tried that didn't work.
Reply
#3
The way I tried is as follows:

1) get unique combinations of A and B
ls1 = df[['A', 'B']].drop_duplicates().values.tolist()
2) get C values for every combination in a list
s = [np.array([tuple(l) == tuple(t) for t in df[['A', 'B']].itertuples(index = False)]) for l in ls1] 
val_C = [list(df.loc[a1, 'C'].values) for a1 in s]
3) And apply Pool.map on the list of lists.

My question is
1) Is this a good approach in solving it?
2) My step 2 seems to be taking huge time. (How to optimize it???)
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020