Python Forum
FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries
#1
This is with pandas dataframes. The code works, but I am getting a warning.

Error:
C:\Users\thpfs\documents\python\cleaner.py:223: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. (entrance_pricediff < 0 and pre_removal_df['price'].iloc[-1] + clean_baseprice < pre_removal_df['price_next'].iloc[-1]): removal_df = removal_df._append(pre_removal_df)
So I think I know why I'm getting that error.

if (entrance_pricediff > 0 and pre_removal_df['price'].iloc[-1] - clean_baseprice > pre_removal_df['price_next'].iloc[-1]) or \
                       (entrance_pricediff < 0 and pre_removal_df['price'].iloc[-1] + clean_baseprice < pre_removal_df['price_next'].iloc[-1]): removal_df = removal_df._append(pre_removal_df)
I have the above code inside of a for loop. If the condition for my if statement is met, then I want to append pre_removal_df to removal_df. At the start of the loop removal_df is empty, however I need to declare removal_df ahead of time because I can't ._append() to a dataframe that isn't previously declared. removal_df is declared as an empty dataframe with all of the same columns as pre_removal_df.

So because on the first go of my for loop I am appending to an empty dataframe, I am getting that warning.

I am wondering if there is a more "proper" way to do this, instead of declaring an empty dataframe and then just appending to it... by "do this" I mean having a loop where each iteration may potentially append one dataframe to another.

Thank you.
Reply
#2
If speed is important, don’t loop and really don’t build data frames a row at a time. This code is what I expect to see when someone has a speed complaint
Reply
#3
(Apr-22-2024, 03:18 AM)deanhystad Wrote: If speed is important, don’t loop and really don’t build data frames a row at a time. This code is what I expect to see when someone has a speed complaint

Yes I completely agree with you - actually the first pass at my dataframe I do with boolean indexing to get rid of erroneous values, but then it gets complicated when I need to get rid of edge cases. Unfortunately this is to deal with an edge case that is simply impossible to get rid of with boolean indexing. Fortunately this loop maybe only has to iterate 50-100 times in a dataset of about 1 million rows, because that's how much the edge case occurs so it's not too bad of a performance hit.

https://www.elitetrader.com/et/threads/p...ket.52398/

What I'm doing is identifying "late prints" in stock market tape and removing these as they completely mess with stop losses in backtesting. Late prints in themselves are an edge case - but single late prints are easy enough to ID with a boolean expression. But then the first edge case to the edge case are where you have multiple in a row where the price is the same - and you don't know how many rows of these multiples are coming. Then the edge case to the edge case to the edge case is where you have multiple in a row, for an unknown number of rows and the price is not the same (but could be the same for some, but more accurately not the same for all).
Reply
#4
Quote: simply impossible to get rid of with boolean indexing
Oh really? Edge conditions can often be detected by shifting the column and comparing the shifted and unshifted columns.
Reply
#5
See the below image. So that's an example of the edge case that I have highlighted. So as you see the price for AMD at this point is in time is hovering around $181.80 - then you see these three prints around $180.80 - so they are late prints that need to be removed.

I actually do just use a shift to compare columns, and that works great where the late print is a singular print. For each stock, I define a certain value that if the price moves greater than X in a single tick, that's a possible late print. So in this case let's use a greater than $0.18 movement in a single tick is a red flag.

I need to check:

1. If the abnormal movement is positive or negative - because if the initial movement is positive, for example, the corresponding movement after the abnormality should be negative. If we have a positive flag followed by a positive flag... that could indicate a very aggressive, but legit, price move so we don't want to filter out that. In the screenshot I provided the initial movement is negative, so the return was positive.

2. The price of the late prints could be the same, or it could be different.

3. If I start using shift, well by how many rows do I shift? This is an example of 3 ticks of late prints, but the actual count is unknown. I've seen as many as 13 late ticks in a row, but that doesn't mean that's 13 is an upper limit, that's just what I have seen.

It's easy to create a filter that filters out just a specific case, but I need to filter out all cases while keeping the false positives as low as possible.

[Image: AMD-late-prints.png]
Reply
#6
I honestly have no idea why the circled entries are circled.
Reply
#7
Look at the price column. Then look at the price in the 3 rows that I have circled - those do not belong. They are cases of "late prints" and need to be filtered out.
Reply
#8
I know I am supposed to look at the price column, but I don't see why those are "late prints". They are not passing the "one of those things is not like the other" test for me. Is there some max delta allowed between entries?
Reply
#9
Look at the entire price column - most of the ticks are around the $181.8 level and then we have these 3 ticks that are around $180.8 -

From the timestamps, you can tell that all of these ticks are within fractions of a second of each other. So, if this price move was legit, then think of a stock like AMD that's trading at $181.80 all of a sudden making a $1 move to $180.80, and then within another fraction of a second makes a move back to $181.80 - price doesn't move like that in reality. What these late prints are, are trades that occured at some point earlier, but are only getting reported, now. So they don't belong here.

Yes you would use a pricedelta to identify them. My rule of thumb is a sudden movement greater than .001 of the opening price for the day - lets just use $180.00 as that price so $0.18 would be the price delta we use.

The price can jump up or down for late print(s), the actual price for the late print(s) could be the same, or different, and the number of late prints is an unknown.
Reply
#10
This provides more info on late prints if you are curious.

https://www.youtube.com/watch?v=OZrMMOHiUeo
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries i sawtooth500 3 2,090 Mar-22-2024, 03:08 AM
Last Post: deanhystad
  String concatenation in SQL update statement hammer 3 1,612 Feb-24-2022, 08:00 PM
Last Post: hammer
  f string concatenation problem growSeb 3 2,339 Jun-28-2021, 05:00 AM
Last Post: buran
  Concatenation ?? ridgerunnersjw 1 1,776 Sep-26-2020, 07:29 PM
Last Post: deanhystad
  FutureWarning: pandas.util.testing is deprecated buunaanaa 3 5,190 May-17-2020, 07:43 AM
Last Post: snippsat
  Combining two strings together (not concatenation) DreamingInsanity 6 3,219 Mar-29-2019, 04:32 PM
Last Post: DreamingInsanity
  Handling null or empty entries from Entry Widget KevinBrown 1 2,365 Mar-17-2019, 04:22 PM
Last Post: perfringo
  append elements into the empty dataframe jazzy 0 2,160 Sep-26-2018, 07:26 AM
Last Post: jazzy
  Regarding concatenation of a list to a string Kedar 2 22,864 Aug-19-2018, 12:57 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020