Python Forum
How can I multithread to optimize a groupby task:
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I multithread to optimize a groupby task:
#1
I created a program to aggregate a large dataframe over one variable - userid. The program executes a groupby to calculate the mean, min and max of 10 variables for each userid. I've enclosed a proxy for this code. First, the code creates the dataframe. Second, it aggregates over userid. The code ran in 20 minutes. I would like to optimize this code by multithreading.

from datetime import datetime

import numpy as np
import random
import pandas as pd
 
print('Initial time:',datetime.now().strftime("%H:%M:%S"))

def fun_user_id(start, end, step):
    num = np.linspace(start, end,(end-start)
                      *int(1/step)+1).tolist()
    return [round(i, 0) for i in num]

def fun_rand_num():
    return list(map(lambda x: random.randint(300,800), range(1, 100000001)))

userid=fun_user_id(1,100000001,.5)
var1=fun_rand_num()
var2=fun_rand_num()
var3=fun_rand_num()
var4=fun_rand_num()
var5=fun_rand_num()
var6=fun_rand_num()
var7=fun_rand_num()
var8=fun_rand_num()
var9=fun_rand_num()
var10=fun_rand_num()


df = pd.DataFrame(list(zip(userid,var1, var2,var3,var4,var5,var6,var7,var8,var9,var10)),
               columns =['userid','var1', 'var2','var3','var4','var5','var6', 'var7','var8','var9','var10'])

varlistdic= {"var1" : ["mean","max","min"], 
             "var2" : ["mean","max","min"],
             "var3" : ["mean","max","min"],
             "var4" : ["mean","max","min"],
             "var5" : ["mean","max","min"],
             "var6" : ["mean","max","min"], 
             "var7" : ["mean","max","min"],
             "var8" : ["mean","max","min"],
             "var9" : ["mean","max","min"],
             "var10" : ["mean","max","min"], 
             }

gr=df.groupby(['userid'])
df_sum=gr.agg(varlistdic)
df_sum=df_sum.pipe(lambda x: x.set_axis(x.columns.map('_'.join),axis=1))
df_sum.reset_index(inplace=True)

print('End Time:',datetime.now().strftime("%H:%M:%S"))
Gribouillis write Jun-30-2023, 03:43 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply


Messages In This Thread
How can I multithread to optimize a groupby task: - by davisc4468 - Jun-30-2023, 02:45 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  count certain task in task manager[solved] kucingkembar 2 1,171 Aug-29-2022, 05:57 PM
Last Post: kucingkembar
  do you have an idea to optimize this code[recursion]]? netanelst 4 1,327 May-20-2022, 06:41 PM
Last Post: jefsummers
  Optimization using scipy.optimize KaneBilliot 3 1,936 Nov-30-2021, 08:03 AM
Last Post: Gribouillis
  Using curve_fit to optimize function (TypeError) Laplace12 4 2,562 Aug-30-2021, 11:15 AM
Last Post: Larz60+
  Schedule a task and render/ use the result of the task in any given time klllmmm 2 2,146 May-04-2021, 10:17 AM
Last Post: klllmmm
  How to measure execution time of a multithread loop spacedog 2 2,923 Apr-24-2021, 07:52 AM
Last Post: spacedog
  How to create a task/import a task(task scheduler) using python Tyrel 7 3,807 Feb-11-2021, 11:45 AM
Last Post: Tyrel
  Why the multithread does not reduce the execution time? Nicely 2 2,523 Nov-23-2019, 02:28 PM
Last Post: Nicely
  is there a way to optimize my checking system? GalaxyCoyote 4 2,797 Oct-13-2019, 09:18 AM
Last Post: perfringo
  cannot import scipy.optimize.Bounds larkypython 2 7,297 May-05-2019, 04:09 AM
Last Post: larkypython

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020