Advice needed on how to approach this problem...

sawtooth500 · Apr-04-2024, 02:37 AM

I need advice on how to approach this problem.

I'm analyzing historical stock tick data. If I take a symbol, say AAPL, years worth of tick data will easily be in the hundreds of millions of rows.

The data comes with a nanosecond timestamp for each ticket, along with the price and size, those three things are the items I need to track.

Right now, I've got a really ugly solution that involves pandas dataframes with loops and takes about 10 hours to run 1 year. So that's not really practical to scale up. I am doing this so that I can backtest various trading strategies for day trading.

I'm working with tick data because in order to get sub-minute resolution for day trading, that is the only option.

I feel like I really need to go with polars for performance reasons.

So I can get a single CSV file that contains an entire years worth of tickers for a symbol like AAPL, and I dump it into a dataframe.

All times that I mention are Eastern time zone. The data for each day spans from 4 AM - 8 PM - those are the full extended market hours. The normal market hours are from 9:30 AM to 4 PM. When analyzing the data, I need to take a "slice" from each day. The slice could be the entire day, 4 AM - 8 PM, but in reality the pre/post market data isn't really good to analyze. So common time slices would be the normal market (9:30 AM - 4 PM), first hour (9:30 AM - 10:30 AM), or first 30 min (9:30 AM - 10:00 AM).

The first thing I do right now is preprocess all my data, and basically I create a new CSV file that has the entire time span (let's say all of 2023), but ONLY the applicable time slice for each day.

Then, from within each time slice, I need to analyze the data in rolling snapshots not fixed time intervals. For example, if I'm looking at 9:30-10:30 timeframe, if I'm looking at 30 second blocks at 1 second intervals then the data I'm examining will be sliced into more sub-blocks in this case:

9:30:00-9:30:30
9:30:01-9:30:31
9:30:02-9:30:32
and so on until 10:30:00

I'm doing all of the slicing using boolean indexing. I'm also trying to use vector operations as much as possible. It would be much simpler to analyze fixed times, like 9:30:00-9:30:29, 9:30:30-9:30:59, etc... but the data is too rough then. Having a rolling window really smooths out the data.

I really, really want to avoid using all of these loops. I also would realy like to avoid the slicing, and then sub-slicing of the slices that I'm doing. But I'm really at a loss of how to approach this problem in a better fashion. The best idea I had thus far was to use polars rolling (https://docs.pola.rs/py-polars/html/refe...me.rolling). My understanding of polars rolling isn't the best, but from what I do understand is that in a dataframe using rolling it goes through row by row and I can specify a time to "look back" - say 30 seconds. But each row of my data is a tick - it's not a second - so I don't need it to roll row by row, I need it to roll forward second by second and jump forward to the next appropriate row (or whatever interval I specify) and I don't think this is possible using rolling. Not to mention other issues, such as if I want to avoid the slicing I'm doing - the roll is irregular - the stock market is typically open M-F, but it's closed on holidays, so I can't use a regular day interval. Not to mention I need to correct for daylight time when the data spans a DST boundary. So right now, in my slicing using loops, I'm specifying all of those corrections manually.

How can I code this more efficiently? There has to be a better way...

***snippsat*** · (This post was last modified: Apr-06-2024, 01:55 PM by snippsat.)

(Apr-04-2024, 02:37 AM)sawtooth500 Wrote: How can I code this more efficiently? There has to be a better way..

You should post code of have done do,our least as sample of the stock dataset input and wanted output.

(Apr-04-2024, 02:37 AM)sawtooth500 Wrote: I really, really want to avoid using all of these loops.

Yes loops are usally really bad for performance when working with dataframes.
Here's a rough sketch with som ideas how might start implementing a solution using Polars,
and see no loops🧬

import polars as pl

df = pl.read_csv("stock.csv")

# Convert the timestamp column to datetime
df = df.with_columns(pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%dT%H:%M:%S%.f"))

# Filter for market hours (9 AM to 4 PM)
market_hours = df.filter(
    (df["timestamp"].dt.hour() >= 9) &
    (df["timestamp"].dt.hour() < 16)
)

# Sort by timestamp to ensure proper rolling window calculation
market_hours = market_hours.sort("timestamp")

# Calculate the rolling mean of price over a 30-second window within market hours
rolling_mean = market_hours.select([
    pl.col("timestamp"),
    pl.col("price").rolling_mean("30s", by="timestamp").alias("sma_price")
])

print(rolling_mean)

Output:┌─────────────────────┬────────────┐
│ timestamp           ┆ sma_price  │
│ ---                 ┆ ---        │
│ datetime[ns]        ┆ f64        │
╞═════════════════════╪════════════╡
│ 2023-01-02 09:00:00 ┆ 181.40271  │
│ 2023-01-02 09:00:01 ┆ 183.268213 │
│ 2023-01-02 09:00:02 ┆ 169.883361 │
│ 2023-01-02 09:00:03 ┆ 159.998933 │
│ 2023-01-02 09:00:04 ┆ 157.532322 │
│ …                   ┆ …          │
│ 2023-01-02 15:59:55 ┆ 147.84868  │
│ 2023-01-02 15:59:56 ┆ 147.572575 │
│ 2023-01-02 15:59:57 ┆ 147.820979 │
│ 2023-01-02 15:59:58 ┆ 148.183034 │
│ 2023-01-02 15:59:59 ┆ 149.068343 │
└─────────────────────┴────────────┘

Stock sample input data:

Output:timestamp,price,size
2023-01-02T04:00:00.000000,154.8813503927325,12
2023-01-02T04:00:01.000000,171.51893663724195,67
2023-01-02T04:00:02.000000,160.27633760716438,20
2023-01-02T04:00:03.000000,154.4883182996897,9
2023-01-02T04:00:04.000000,142.36547993389047,79
......

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Sound Approach for Running a Shell Command?	matt_the_hall	8	5,211	Dec-14-2020, 02:52 PM Last Post: matt_the_hall
	Need feedback on my approach for python dashboard for Asana	pashtett	0	1,866	Nov-24-2020, 11:51 AM Last Post: pashtett
	Approach to creating Audit table	pynewbie	4	5,339	Feb-24-2020, 06:12 PM Last Post: pynewbie
	list approach due nested order	3Pinter	6	4,272	Oct-07-2019, 01:49 PM Last Post: 3Pinter
	Whats a good design/approach?	hshivaraj	1	2,486	Sep-16-2019, 01:44 AM Last Post: snippsat
	Recommendation after running regression (approach)	danishzmalik	0	2,253	Jul-17-2018, 09:31 AM Last Post: danishzmalik
	Which approach is better to copy a list?	nexusfactor	6	6,217	Oct-15-2017, 10:45 PM Last Post: ichabod801

Advice needed on how to approach this problem...

User Panel Messages

Announcements