Python Forum
How can I multithread to optimize a groupby task: - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How can I multithread to optimize a groupby task: (/thread-40259.html)



How can I multithread to optimize a groupby task: - davisc4468 - Jun-30-2023

I created a program to aggregate a large dataframe over one variable - userid. The program executes a groupby to calculate the mean, min and max of 10 variables for each userid. I've enclosed a proxy for this code. First, the code creates the dataframe. Second, it aggregates over userid. The code ran in 20 minutes. I would like to optimize this code by multithreading.

from datetime import datetime

import numpy as np
import random
import pandas as pd
 
print('Initial time:',datetime.now().strftime("%H:%M:%S"))

def fun_user_id(start, end, step):
    num = np.linspace(start, end,(end-start)
                      *int(1/step)+1).tolist()
    return [round(i, 0) for i in num]

def fun_rand_num():
    return list(map(lambda x: random.randint(300,800), range(1, 100000001)))

userid=fun_user_id(1,100000001,.5)
var1=fun_rand_num()
var2=fun_rand_num()
var3=fun_rand_num()
var4=fun_rand_num()
var5=fun_rand_num()
var6=fun_rand_num()
var7=fun_rand_num()
var8=fun_rand_num()
var9=fun_rand_num()
var10=fun_rand_num()


df = pd.DataFrame(list(zip(userid,var1, var2,var3,var4,var5,var6,var7,var8,var9,var10)),
               columns =['userid','var1', 'var2','var3','var4','var5','var6', 'var7','var8','var9','var10'])

varlistdic= {"var1" : ["mean","max","min"], 
             "var2" : ["mean","max","min"],
             "var3" : ["mean","max","min"],
             "var4" : ["mean","max","min"],
             "var5" : ["mean","max","min"],
             "var6" : ["mean","max","min"], 
             "var7" : ["mean","max","min"],
             "var8" : ["mean","max","min"],
             "var9" : ["mean","max","min"],
             "var10" : ["mean","max","min"], 
             }

gr=df.groupby(['userid'])
df_sum=gr.agg(varlistdic)
df_sum=df_sum.pipe(lambda x: x.set_axis(x.columns.map('_'.join),axis=1))
df_sum.reset_index(inplace=True)

print('End Time:',datetime.now().strftime("%H:%M:%S"))