Apr-02-2024, 03:05 PM
Well now I've discovered rolling on polars dataframes - could I use this to eliminate my while loop altogether?
So as practice, ts_event in my CSV file is nanosecond unix timestamps. I first convert this column to datetime objects. The column 'size' is just a column of integers, say typically ranging anywhere from between 1-500 in value.
I want to take the first 30s worth of rows, and find the sum of the size col in that 35s. Then I want to increment up by one second and find the next sum.
For example, let's say my data starts at 08:00:00 -
So first I want to find the sum of the size col beween
8:00:00-8:00:30
then
8:00:01-8:00:31
8:00:02-8:00:32
and so on...
And for now we can just deposit these results into another dataframe.
So with the above code I posted I get the error
How do I set the '.set_sorted()' flag?
1 2 3 4 5 6 7 8 9 10 11 12 |
df = pl.read_csv(sys.argv[ 1 ]) print (df) df = df.with_columns(pl.col( "ts_event" ).cast(pl.Datetime)) print (df) # Define your rolling window in time window_duration = "35s" # 35 seconds window every_duration = "1s" # Shift the window every 1 second out = df.rolling(index_column = 'ts_event' , period = '30s' , offset = '1s' ).agg(pl.col( 'size' ). sum ()) print (out) |
I want to take the first 30s worth of rows, and find the sum of the size col in that 35s. Then I want to increment up by one second and find the next sum.
For example, let's say my data starts at 08:00:00 -
So first I want to find the sum of the size col beween
8:00:00-8:00:30
then
8:00:01-8:00:31
8:00:02-8:00:32
and so on...
And for now we can just deposit these results into another dataframe.
So with the above code I posted I get the error
Error:- If your data is ALREADY sorted, set the sorted flag with: '.set_sorted()'.
My ts_event col is already sorted chronologically. But I can't figure out the syntax for how to set that flag in my code - no matter where I try it, it keeps throwing me another error. How do I set the '.set_sorted()' flag?