Python Forum
How can I multithread to optimize a groupby task:
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I multithread to optimize a groupby task:
#1
I created a program to aggregate a large dataframe over one variable - userid. The program executes a groupby to calculate the mean, min and max of 10 variables for each userid. I've enclosed a proxy for this code. First, the code creates the dataframe. Second, it aggregates over userid. The code ran in 20 minutes. I would like to optimize this code by multithreading.

from datetime import datetime

import numpy as np
import random
import pandas as pd
 
print('Initial time:',datetime.now().strftime("%H:%M:%S"))

def fun_user_id(start, end, step):
    num = np.linspace(start, end,(end-start)
                      *int(1/step)+1).tolist()
    return [round(i, 0) for i in num]

def fun_rand_num():
    return list(map(lambda x: random.randint(300,800), range(1, 100000001)))

userid=fun_user_id(1,100000001,.5)
var1=fun_rand_num()
var2=fun_rand_num()
var3=fun_rand_num()
var4=fun_rand_num()
var5=fun_rand_num()
var6=fun_rand_num()
var7=fun_rand_num()
var8=fun_rand_num()
var9=fun_rand_num()
var10=fun_rand_num()


df = pd.DataFrame(list(zip(userid,var1, var2,var3,var4,var5,var6,var7,var8,var9,var10)),
               columns =['userid','var1', 'var2','var3','var4','var5','var6', 'var7','var8','var9','var10'])

varlistdic= {"var1" : ["mean","max","min"], 
             "var2" : ["mean","max","min"],
             "var3" : ["mean","max","min"],
             "var4" : ["mean","max","min"],
             "var5" : ["mean","max","min"],
             "var6" : ["mean","max","min"], 
             "var7" : ["mean","max","min"],
             "var8" : ["mean","max","min"],
             "var9" : ["mean","max","min"],
             "var10" : ["mean","max","min"], 
             }

gr=df.groupby(['userid'])
df_sum=gr.agg(varlistdic)
df_sum=df_sum.pipe(lambda x: x.set_axis(x.columns.map('_'.join),axis=1))
df_sum.reset_index(inplace=True)

print('End Time:',datetime.now().strftime("%H:%M:%S"))
Gribouillis write Jun-30-2023, 03:43 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  count certain task in task manager[solved] kucingkembar 2 1,133 Aug-29-2022, 05:57 PM
Last Post: kucingkembar
  do you have an idea to optimize this code[recursion]]? netanelst 4 1,302 May-20-2022, 06:41 PM
Last Post: jefsummers
  Optimization using scipy.optimize KaneBilliot 3 1,914 Nov-30-2021, 08:03 AM
Last Post: Gribouillis
  Using curve_fit to optimize function (TypeError) Laplace12 4 2,525 Aug-30-2021, 11:15 AM
Last Post: Larz60+
  Schedule a task and render/ use the result of the task in any given time klllmmm 2 2,109 May-04-2021, 10:17 AM
Last Post: klllmmm
  How to measure execution time of a multithread loop spacedog 2 2,907 Apr-24-2021, 07:52 AM
Last Post: spacedog
  How to create a task/import a task(task scheduler) using python Tyrel 7 3,760 Feb-11-2021, 11:45 AM
Last Post: Tyrel
  Why the multithread does not reduce the execution time? Nicely 2 2,494 Nov-23-2019, 02:28 PM
Last Post: Nicely
  is there a way to optimize my checking system? GalaxyCoyote 4 2,768 Oct-13-2019, 09:18 AM
Last Post: perfringo
  cannot import scipy.optimize.Bounds larkypython 2 7,249 May-05-2019, 04:09 AM
Last Post: larkypython

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020