Python Forum
Advice needed on how to approach this problem...
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Advice needed on how to approach this problem...
#1
I need advice on how to approach this problem.

I'm analyzing historical stock tick data. If I take a symbol, say AAPL, years worth of tick data will easily be in the hundreds of millions of rows.

The data comes with a nanosecond timestamp for each ticket, along with the price and size, those three things are the items I need to track.

Right now, I've got a really ugly solution that involves pandas dataframes with loops and takes about 10 hours to run 1 year. So that's not really practical to scale up. I am doing this so that I can backtest various trading strategies for day trading.

I'm working with tick data because in order to get sub-minute resolution for day trading, that is the only option.

I feel like I really need to go with polars for performance reasons.

So I can get a single CSV file that contains an entire years worth of tickers for a symbol like AAPL, and I dump it into a dataframe.

All times that I mention are Eastern time zone. The data for each day spans from 4 AM - 8 PM - those are the full extended market hours. The normal market hours are from 9:30 AM to 4 PM. When analyzing the data, I need to take a "slice" from each day. The slice could be the entire day, 4 AM - 8 PM, but in reality the pre/post market data isn't really good to analyze. So common time slices would be the normal market (9:30 AM - 4 PM), first hour (9:30 AM - 10:30 AM), or first 30 min (9:30 AM - 10:00 AM).

The first thing I do right now is preprocess all my data, and basically I create a new CSV file that has the entire time span (let's say all of 2023), but ONLY the applicable time slice for each day.

Then, from within each time slice, I need to analyze the data in rolling snapshots not fixed time intervals. For example, if I'm looking at 9:30-10:30 timeframe, if I'm looking at 30 second blocks at 1 second intervals then the data I'm examining will be sliced into more sub-blocks in this case:

9:30:00-9:30:30
9:30:01-9:30:31
9:30:02-9:30:32
and so on until 10:30:00

I'm doing all of the slicing using boolean indexing. I'm also trying to use vector operations as much as possible. It would be much simpler to analyze fixed times, like 9:30:00-9:30:29, 9:30:30-9:30:59, etc... but the data is too rough then. Having a rolling window really smooths out the data.

I really, really want to avoid using all of these loops. I also would realy like to avoid the slicing, and then sub-slicing of the slices that I'm doing. But I'm really at a loss of how to approach this problem in a better fashion. The best idea I had thus far was to use polars rolling (https://docs.pola.rs/py-polars/html/refe...me.rolling). My understanding of polars rolling isn't the best, but from what I do understand is that in a dataframe using rolling it goes through row by row and I can specify a time to "look back" - say 30 seconds. But each row of my data is a tick - it's not a second - so I don't need it to roll row by row, I need it to roll forward second by second and jump forward to the next appropriate row (or whatever interval I specify) and I don't think this is possible using rolling. Not to mention other issues, such as if I want to avoid the slicing I'm doing - the roll is irregular - the stock market is typically open M-F, but it's closed on holidays, so I can't use a regular day interval. Not to mention I need to correct for daylight time when the data spans a DST boundary. So right now, in my slicing using loops, I'm specifying all of those corrections manually.

How can I code this more efficiently? There has to be a better way...
Reply
#2
(Apr-04-2024, 02:37 AM)sawtooth500 Wrote: How can I code this more efficiently? There has to be a better way..
You should post code of have done do,our least as sample of the stock dataset input and wanted output.
(Apr-04-2024, 02:37 AM)sawtooth500 Wrote: I really, really want to avoid using all of these loops.
Yes loops are usally really bad for performance when working with dataframes.
Here's a rough sketch with som ideas how might start implementing a solution using Polars,
and see no loops🧬
import polars as pl

df = pl.read_csv("stock.csv")

# Convert the timestamp column to datetime
df = df.with_columns(pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%dT%H:%M:%S%.f"))

# Filter for market hours (9 AM to 4 PM)
market_hours = df.filter(
    (df["timestamp"].dt.hour() >= 9) &
    (df["timestamp"].dt.hour() < 16)
)

# Sort by timestamp to ensure proper rolling window calculation
market_hours = market_hours.sort("timestamp")

# Calculate the rolling mean of price over a 30-second window within market hours
rolling_mean = market_hours.select([
    pl.col("timestamp"),
    pl.col("price").rolling_mean("30s", by="timestamp").alias("sma_price")
])

print(rolling_mean)
Output:
┌─────────────────────┬────────────┐ │ timestamp ┆ sma_price │ │ --- ┆ --- │ │ datetime[ns] ┆ f64 │ ╞═════════════════════╪════════════╡ │ 2023-01-02 09:00:00 ┆ 181.40271 │ │ 2023-01-02 09:00:01 ┆ 183.268213 │ │ 2023-01-02 09:00:02 ┆ 169.883361 │ │ 2023-01-02 09:00:03 ┆ 159.998933 │ │ 2023-01-02 09:00:04 ┆ 157.532322 │ │ … ┆ … │ │ 2023-01-02 15:59:55 ┆ 147.84868 │ │ 2023-01-02 15:59:56 ┆ 147.572575 │ │ 2023-01-02 15:59:57 ┆ 147.820979 │ │ 2023-01-02 15:59:58 ┆ 148.183034 │ │ 2023-01-02 15:59:59 ┆ 149.068343 │ └─────────────────────┴────────────┘
Stock sample input data:
Output:
timestamp,price,size 2023-01-02T04:00:00.000000,154.8813503927325,12 2023-01-02T04:00:01.000000,171.51893663724195,67 2023-01-02T04:00:02.000000,160.27633760716438,20 2023-01-02T04:00:03.000000,154.4883182996897,9 2023-01-02T04:00:04.000000,142.36547993389047,79 ......
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Sound Approach for Running a Shell Command? matt_the_hall 8 3,402 Dec-14-2020, 02:52 PM
Last Post: matt_the_hall
  Need feedback on my approach for python dashboard for Asana pashtett 0 1,341 Nov-24-2020, 11:51 AM
Last Post: pashtett
  Approach to creating Audit table pynewbie 4 3,877 Feb-24-2020, 06:12 PM
Last Post: pynewbie
  list approach due nested order 3Pinter 6 2,871 Oct-07-2019, 01:49 PM
Last Post: 3Pinter
  Whats a good design/approach? hshivaraj 1 1,792 Sep-16-2019, 01:44 AM
Last Post: snippsat
  Recommendation after running regression (approach) danishzmalik 0 1,654 Jul-17-2018, 09:31 AM
Last Post: danishzmalik
  Which approach is better to copy a list? nexusfactor 6 4,735 Oct-15-2017, 10:45 PM
Last Post: ichabod801
  Randomise network while maintaining topology of nodes using edge-swap approach Tom1988 3 4,151 May-25-2017, 10:59 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020