Python Forum
Suggestion on how to speed up this code? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Suggestion on how to speed up this code? (/thread-42072.html)



Suggestion on how to speed up this code? - sawtooth500 - May-04-2024

tester_list = []
def day_test(df_row):
    day_start_time = time.time()
    tester = completedf[(completedf['date'] == unique_dates[df_row['b_index'].iloc[0]]) & (completedf['volwagroup'] == df_row.index[0][0]) & (completedf['int_dur'] == df_row.index[0][1]) & (completedf['int_calc'] == df_row.index[0][2]) & (completedf['stopgroup'] == df_row.index[0][3]) & (completedf['daylossgrp'] == df_row.index[0][4]) & (completedf['timegroup'] == df_row.index[0][5])]
    if tester.empty: ror = 0
    else: ror = tester['%ROR'].iloc[0]
    print('Finished day testing ' + unique_dates[df_row['b_index'].iloc[0]] + ' sample_dur: ' + str(df_row['sample_dur'].iloc[0]) + ' @ ' + datetime.now().strftime(progressformat) + ' - Execution time (HH:MM:SS.xx) : ' + timer(day_start_time))
    tester_list.append((df_row['sample_dur'].iloc[0], ror))


backtestdf_complete.groupby(['b_index', 'sample_dur']).apply(day_test)
test_df = pd.DataFrame(tester_list, columns=['day_dur', '%ROR'])
So in the above code I do a groupby operation on backtestdf_complete (a pandas dataframe), and use a .apply() to run function day_test on it. The day_test function runs a boolean filtering expression on dataframe completedf, and I append two entries from the filtered dataframe into a list which I later convert into another dataframe called test_df.

So right now for my testing sample, completedf has about 1.2 million rows, and this function runs about 10,000 times in the .apply() - it takes like an hour and 15 min to do. FYI for my boolean indexing, each filter will only return one single, non sequential, row from completedf. Basically I'm picking out 10,000ish rows from a 1.2 mil row dataframe, and the rows are not sequential nor at any fixed interval.

In another version of this I've tried parellelizing this using joblib, but that's actually a bit slower - each function executes fast, but it gets called 10,000 is times. So with the joblib overhead, it's slightly slower than calling it singlethread. Singlthread takes about .2-.25 seconds per function, joblib seems like .25-.3 seconds.

I've also considered getting rid of the function altogether and just using a for loop, then I get rid of the function overhead, but I know how slow loops can be vs using a .apply() on a dataframe...

So suggestions? Is there anny better way to pick out about 10000ish rows from 1.2 million? I will have bigger sample sizes in future so I want to make this efficient.


RE: Suggestion on how to speed up this code? - sawtooth500 - May-04-2024

Solved my own issue actually - I converted the relevant columns in completedf to a multi-index, now the operation is virtually instantaneous.

def day_test(df_row):
    day_start_time = time.time()
    try:
        tester = completedf.loc[(df_row.index[0][0], df_row.index[0][1], df_row.index[0][2], df_row.index[0][3], df_row.index[0][4], df_row.index[0][5], unique_dates[df_row['b_index'].iloc[0]])]
        ror = tester['%ROR']
    except: ror = 0
    #print('Finished day testing ' + unique_dates[df_row['b_index'].iloc[0]] + ' sample_dur: ' + str(df_row['sample_dur'].iloc[0]) + ' @ ' + datetime.now().strftime(progressformat) + ' - Execution time (HH:MM:SS.xx) : ' + timer(day_start_time))
    tester_list.append((df_row['sample_dur'].iloc[0], ror))