Apr-01-2024, 04:41 AM
So here is the crux of my issue:
For example, I have a dataframe with 1.8 million input rows. However, from that I extrapolate it into 62.1 million rows from which I need to find a weighted average. So really I'm crunching 62.1 million rows, and this takes 23-ish minutes, for example.
As a simplication, my dataframe df has two cols 'A' and 'B' filled with numbers and and I need to calculate the average of the numbers in col A weighted by the numbers in col B.
But... I don't need to find the weighted average of the original 1.8 million rows. I need to find the weighted averages of overlapping sets of rows - which in example total to 62.1 million rows.
For example - I would take rows
0-2832
672-3293
1189-4102
1382-4204
2902 - 4680
and so on....
I have an algorithm that determines the relevant row numbers for each subset. There is no consistent pattern of a fixed increment or anything like that with the row numbers.
So it starts with a while loop (this is the thing that is probably why the time to process is so long)
So, any better way to do it?
For example, I have a dataframe with 1.8 million input rows. However, from that I extrapolate it into 62.1 million rows from which I need to find a weighted average. So really I'm crunching 62.1 million rows, and this takes 23-ish minutes, for example.
As a simplication, my dataframe df has two cols 'A' and 'B' filled with numbers and and I need to calculate the average of the numbers in col A weighted by the numbers in col B.
But... I don't need to find the weighted average of the original 1.8 million rows. I need to find the weighted averages of overlapping sets of rows - which in example total to 62.1 million rows.
For example - I would take rows
0-2832
672-3293
1189-4102
1382-4204
2902 - 4680
and so on....
I have an algorithm that determines the relevant row numbers for each subset. There is no consistent pattern of a fixed increment or anything like that with the row numbers.
So it starts with a while loop (this is the thing that is probably why the time to process is so long)
while condition: boolean indexing to create a new dataframe that contains my relevant subset, call it tempdf weighted_avg = (tempdf['A'] * tempdf['B']).sum() / tempdf['B'].sum() append the weighted average as a new row into a results dataframe called resultsdfSo, as you see, this is how I turn 1.8 million rows into 62.1 million rows to process, because there is a lot of overlap in the subset dataframe, and because of the math involved in calculated a weighted average, and given how there is no simple incremental pattern in the indices that are used to create the subset dataframe, I don't know how to do this any way other than with a while loop that takes a very long time to cycle through.
So, any better way to do it?