Python Forum
Advice needed on how to approach this problem...
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Advice needed on how to approach this problem...
#1
I need advice on how to approach this problem.

I'm analyzing historical stock tick data. If I take a symbol, say AAPL, years worth of tick data will easily be in the hundreds of millions of rows.

The data comes with a nanosecond timestamp for each ticket, along with the price and size, those three things are the items I need to track.

Right now, I've got a really ugly solution that involves pandas dataframes with loops and takes about 10 hours to run 1 year. So that's not really practical to scale up. I am doing this so that I can backtest various trading strategies for day trading.

I'm working with tick data because in order to get sub-minute resolution for day trading, that is the only option.

I feel like I really need to go with polars for performance reasons.

So I can get a single CSV file that contains an entire years worth of tickers for a symbol like AAPL, and I dump it into a dataframe.

All times that I mention are Eastern time zone. The data for each day spans from 4 AM - 8 PM - those are the full extended market hours. The normal market hours are from 9:30 AM to 4 PM. When analyzing the data, I need to take a "slice" from each day. The slice could be the entire day, 4 AM - 8 PM, but in reality the pre/post market data isn't really good to analyze. So common time slices would be the normal market (9:30 AM - 4 PM), first hour (9:30 AM - 10:30 AM), or first 30 min (9:30 AM - 10:00 AM).

The first thing I do right now is preprocess all my data, and basically I create a new CSV file that has the entire time span (let's say all of 2023), but ONLY the applicable time slice for each day.

Then, from within each time slice, I need to analyze the data in rolling snapshots not fixed time intervals. For example, if I'm looking at 9:30-10:30 timeframe, if I'm looking at 30 second blocks at 1 second intervals then the data I'm examining will be sliced into more sub-blocks in this case:

9:30:00-9:30:30
9:30:01-9:30:31
9:30:02-9:30:32
and so on until 10:30:00

I'm doing all of the slicing using boolean indexing. I'm also trying to use vector operations as much as possible. It would be much simpler to analyze fixed times, like 9:30:00-9:30:29, 9:30:30-9:30:59, etc... but the data is too rough then. Having a rolling window really smooths out the data.

I really, really want to avoid using all of these loops. I also would realy like to avoid the slicing, and then sub-slicing of the slices that I'm doing. But I'm really at a loss of how to approach this problem in a better fashion. The best idea I had thus far was to use polars rolling (https://docs.pola.rs/py-polars/html/refe...me.rolling). My understanding of polars rolling isn't the best, but from what I do understand is that in a dataframe using rolling it goes through row by row and I can specify a time to "look back" - say 30 seconds. But each row of my data is a tick - it's not a second - so I don't need it to roll row by row, I need it to roll forward second by second and jump forward to the next appropriate row (or whatever interval I specify) and I don't think this is possible using rolling. Not to mention other issues, such as if I want to avoid the slicing I'm doing - the roll is irregular - the stock market is typically open M-F, but it's closed on holidays, so I can't use a regular day interval. Not to mention I need to correct for daylight time when the data spans a DST boundary. So right now, in my slicing using loops, I'm specifying all of those corrections manually.

How can I code this more efficiently? There has to be a better way...
Reply


Messages In This Thread
Advice needed on how to approach this problem... - by sawtooth500 - Apr-04-2024, 02:37 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Sound Approach for Running a Shell Command? matt_the_hall 8 3,483 Dec-14-2020, 02:52 PM
Last Post: matt_the_hall
  Need feedback on my approach for python dashboard for Asana pashtett 0 1,360 Nov-24-2020, 11:51 AM
Last Post: pashtett
  Approach to creating Audit table pynewbie 4 3,933 Feb-24-2020, 06:12 PM
Last Post: pynewbie
  list approach due nested order 3Pinter 6 2,913 Oct-07-2019, 01:49 PM
Last Post: 3Pinter
  Whats a good design/approach? hshivaraj 1 1,823 Sep-16-2019, 01:44 AM
Last Post: snippsat
  Recommendation after running regression (approach) danishzmalik 0 1,704 Jul-17-2018, 09:31 AM
Last Post: danishzmalik
  Which approach is better to copy a list? nexusfactor 6 4,809 Oct-15-2017, 10:45 PM
Last Post: ichabod801
  Randomise network while maintaining topology of nodes using edge-swap approach Tom1988 3 4,185 May-25-2017, 10:59 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020