Jun-30-2023, 02:45 PM
I created a program to aggregate a large dataframe over one variable - userid. The program executes a groupby to calculate the mean, min and max of 10 variables for each userid. I've enclosed a proxy for this code. First, the code creates the dataframe. Second, it aggregates over userid. The code ran in 20 minutes. I would like to optimize this code by multithreading.
from datetime import datetime import numpy as np import random import pandas as pd print('Initial time:',datetime.now().strftime("%H:%M:%S")) def fun_user_id(start, end, step): num = np.linspace(start, end,(end-start) *int(1/step)+1).tolist() return [round(i, 0) for i in num] def fun_rand_num(): return list(map(lambda x: random.randint(300,800), range(1, 100000001))) userid=fun_user_id(1,100000001,.5) var1=fun_rand_num() var2=fun_rand_num() var3=fun_rand_num() var4=fun_rand_num() var5=fun_rand_num() var6=fun_rand_num() var7=fun_rand_num() var8=fun_rand_num() var9=fun_rand_num() var10=fun_rand_num() df = pd.DataFrame(list(zip(userid,var1, var2,var3,var4,var5,var6,var7,var8,var9,var10)), columns =['userid','var1', 'var2','var3','var4','var5','var6', 'var7','var8','var9','var10']) varlistdic= {"var1" : ["mean","max","min"], "var2" : ["mean","max","min"], "var3" : ["mean","max","min"], "var4" : ["mean","max","min"], "var5" : ["mean","max","min"], "var6" : ["mean","max","min"], "var7" : ["mean","max","min"], "var8" : ["mean","max","min"], "var9" : ["mean","max","min"], "var10" : ["mean","max","min"], } gr=df.groupby(['userid']) df_sum=gr.agg(varlistdic) df_sum=df_sum.pipe(lambda x: x.set_axis(x.columns.map('_'.join),axis=1)) df_sum.reset_index(inplace=True) print('End Time:',datetime.now().strftime("%H:%M:%S"))