Multiprocessing on python

sawtooth500 · Apr-02-2024, 03:05 PM

Well now I've discovered rolling on polars dataframes - could I use this to eliminate my while loop altogether?

df = pl.read_csv(sys.argv[1])
print(df)
df = df.with_columns(pl.col("ts_event").cast(pl.Datetime))
print(df)

# Define your rolling window in time
window_duration = "35s"  # 35 seconds window
every_duration = "1s"    # Shift the window every 1 second

out = df.rolling(index_column = 'ts_event', period = '30s', offset = '1s').agg(pl.col('size').sum())

print(out)

So as practice, ts_event in my CSV file is nanosecond unix timestamps. I first convert this column to datetime objects. The column 'size' is just a column of integers, say typically ranging anywhere from between 1-500 in value.

I want to take the first 30s worth of rows, and find the sum of the size col in that 35s. Then I want to increment up by one second and find the next sum.

For example, let's say my data starts at 08:00:00 -

So first I want to find the sum of the size col beween

8:00:00-8:00:30
then
8:00:01-8:00:31
8:00:02-8:00:32

and so on...

And for now we can just deposit these results into another dataframe.

So with the above code I posted I get the error

Error:
- If your data is ALREADY sorted, set the sorted flag with: '.set_sorted()'.

My ts_event col is already sorted chronologically. But I can't figure out the syntax for how to set that flag in my code - no matter where I try it, it keeps throwing me another error.

How do I set the '.set_sorted()' flag?

**deanhystad** · Apr-02-2024, 04:14 PM

Post entire error message, including the traceback.

sawtooth500 · Apr-02-2024, 06:03 PM

Error:shape: (614_076, 5)
┌──────────────────────────────┬───────────────────────────────┬──────┬────────┬──────┐
│ ts_event                     ┆ eastern_time                  ┆ side ┆ price  ┆ size │
│ ---                          ┆ ---                           ┆ ---  ┆ ---    ┆ ---  │
│ datetime[μs]                 ┆ str                           ┆ str  ┆ f64    ┆ i64  │
╞══════════════════════════════╪═══════════════════════════════╪══════╪════════╪══════╡
│ +56135-10-13 20:00:02.663539 ┆ 03-01-2024 09:30:00.002663424 ┆ N    ┆ 198.05 ┆ 34   │
│ +56135-10-13 20:00:02.663539 ┆ 03-01-2024 09:30:00.002663424 ┆ N    ┆ 198.05 ┆ 4    │
│ +56135-10-13 20:00:03.314087 ┆ 03-01-2024 09:30:00.003314176 ┆ N    ┆ 198.05 ┆ 46   │
│ +56135-10-13 20:00:03.314087 ┆ 03-01-2024 09:30:00.003314176 ┆ N    ┆ 198.05 ┆ 34   │
│ +56135-10-13 20:00:03.314087 ┆ 03-01-2024 09:30:00.003314176 ┆ N    ┆ 198.06 ┆ 5    │
│ …                            ┆ …                             ┆ …    ┆ …      ┆ …    │
│ +56209-08-25 23:57:44.164197 ┆ 03-28-2024 09:59:59.864164096 ┆ N    ┆ 180.47 ┆ 5    │
│ +56209-08-25 23:57:44.164197 ┆ 03-28-2024 09:59:59.864164096 ┆ B    ┆ 180.48 ┆ 95   │
│ +56209-08-25 23:57:44.210570 ┆ 03-28-2024 09:59:59.864210688 ┆ B    ┆ 180.48 ┆ 5    │
│ +56209-08-25 23:57:44.341235 ┆ 03-28-2024 09:59:59.864341248 ┆ B    ┆ 180.49 ┆ 5    │
│ +56209-08-25 23:57:44.810835 ┆ 03-28-2024 09:59:59.864810752 ┆ N    ┆ 180.49 ┆ 7    │
└──────────────────────────────┴───────────────────────────────┴──────┴────────┴──────┘
Traceback (most recent call last):
  File "C:\Users\thpfs\Documents\Python\volwa.py", line 40, in <module>
    out = df.rolling(index_column = 'ts_event', period = '35s', offset = '1s').agg(pl.col("size").sum())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thpfs\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\dataframe\group_by.py", line 894, in agg
    .collect(no_optimization=True)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thpfs\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\lazyframe\frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: argument in operation 'rolling' is not explicitly sorted

- If your data is ALREADY sorted, set the sorted flag with: '.set_sorted()'.
- If your data is NOT sorted, sort the 'expr/series/column' first.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to run existing python script parallel using multiprocessing	lravikumarvsp	3	5,981	May-24-2018, 05:23 AM Last Post: lravikumarvsp

Multiprocessing on python

User Panel Messages

Announcements