So Dean was previously kind enough to tell me about vector operations on dataframes instead of iterating through each row using loops and has made my code so much better...
So now I have a situation where I have a dataframe with about 930,000 rows. I need to reduce that dataframe by kicking out select rows, which I plan to do via a boolean indexing.
However, I also have an array with n values inside of it (my test case has 5 values in array, but I need the code to with with n values in array).
For each value in my array, I need to run a function that will output start and stop row numbers of rows that I want to keep in my dataframe.
So the way I know how to do this is to use a loop to run through each value in my array, and based on that get all my starting and ending rows for each time the loop runs through and then use boolean indexing to reduce the dataframe.
However, I was wondering, is there a more elegant way to do this? Or do I just need to loop it? Thanks!
You should provide a short example of the dataframe.
Here is a printout of the dataframe. This is stock tick data, 5 days worth of data in a single dataframe. This includes the full extended market hours, which are from 4 AM - 8 PM every day. I need to reduce this data to only the data during normal market hours each day, which is 9:30 AM - 4 PM. At the top of the screenshot you see my dates array - I already wrote the code that parses the entire dataset, determines all the unique dates (in this case it's 5 days worth of data but could be any n days worth of data), and outputs an array of all the dates. ts_event are nanosecond unix timestamps.
So my idea was to write the code that determines the relevant timestamps (9:30 AM and 4 PM for each day) and then use boolean indexing to extract the rows for each day and append them into a new dataframe. I was planning on using a loop to cycle through each day. But is there a more elegant way to do it?
![[Image: dataframe-example.png]](https://i.ibb.co/YWCMYhz/dataframe-example.png)
So as you see the original dataframe has 5 days worth of data, 3-11-2024 through 3-15-2024, and each day ranges from 04:00 to 20:00 - I need to be able to reduce this by extracting only certain time ranges for EACH day - for example - 9:30 through 16:00 for each day - and recombine into a single dataframe. So I know that I can do this by using a for loop for the start/end times for each day, then use boolean indexing to extract the rows for each day then reassemble.
timestarter = 34200000000000 #9h 30m in nanoseconds
time_ender = 57600000000000 #16h 0m in nanoseconds
for value in dates_array:
datetime_object = datetime.strptime(value, day_format)
day_unix = datetime_object.timestamp() * 1e9
daystarter = day_unix + timestarter
dayender = day_unix + time_ender
finaldf = finaldf._append(df[(df['ts_event'] >= daystarter) & (df['ts_event'] < dayender)], ignore_index = True)
Example of dates_array:
['03-11-2024' '03-12-2024' '03-13-2024' '03-14-2024' '03-15-2024']
In this code daystarter and dayender are my unix timestamps of the start/end range during each day that I want to keep. Dates array is an array of my dates in MM-DD-YYYY format. df is my original dataframe. finaldf is my new, reduced dataframe. You see how I loop through each individual day here.
I'm just wondering is there a better way to do this than using a loop to go through each day. When I have 5 days of data like in this example, it's not a big deal. But when I'm doing 1000 days of data in one go... it may be onerous unless it's the only way.
This gives you the first and last time for each day. Does it solve your problem?
iimport pandas as pd
from datetime import datetime, timedelta
from time import time
now = datetime.now()
df = pd.DataFrame({"time": [now + timedelta(seconds=x) for x in range(930000)]})
start = time()
df["day"] = df.time.dt.day
df2 = df[df.day.shift(1) != df.day.shift(-1)]
print(time() - start)
print(df2)
Output:
0.0659632682800293
time day
0 2024-03-28 16:26:10.637194 28
27229 2024-03-28 23:59:59.637194 28
27230 2024-03-29 00:00:00.637194 29
113629 2024-03-29 23:59:59.637194 29
113630 2024-03-30 00:00:00.637194 30
200029 2024-03-30 23:59:59.637194 30
200030 2024-03-31 00:00:00.637194 31
286429 2024-03-31 23:59:59.637194 31
286430 2024-04-01 00:00:00.637194 1
372829 2024-04-01 23:59:59.637194 1
372830 2024-04-02 00:00:00.637194 2
459229 2024-04-02 23:59:59.637194 2
459230 2024-04-03 00:00:00.637194 3
545629 2024-04-03 23:59:59.637194 3
Another approach is to extract the day as above, then group the dataframe by day. You could compute the high, low, nean, open, close for each day.
Yes thanks that did help, I didn't know about timedelta but that makes things easier for me.
You wouldn't need to use timedelta. I used timedelta to make some fake data to demonstrate the filter. Your data already has a datetime object