Python Forum
Filling NaNs in a financial dataset
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Filling NaNs in a financial dataset
#1
I have a financial dataframe consisting of 500 different columns. I have categorical and informational columns (in this case, the year, the quarter and the CIK code are the most important) and I have items from their financial report. Some of these columns have NaN values I want to replace. I thought the most reliable way to deal with these NaNs would be to take the average of the value reported the quarter before and the quarter after. To do this I wrote the following code:
def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index
    
    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']
                   
                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
    return merged_df_nan
In the first rows I am seperating the columns I want to fill with this method from the columns I do not want to alter at all. Then I first iterate over the columns and then over the rows. This is, however, incredibly slow and I am not even sure if this would work. Does anybody else have some suggestions?
Reply


Messages In This Thread
Filling NaNs in a financial dataset - by larzz - Jun-01-2023, 09:40 PM
RE: Filling NaNs in a financial dataset - by larzz - Jun-03-2023, 10:51 AM
RE: Filling NaNs in a financial dataset - by larzz - Jun-04-2023, 07:56 AM
RE: Filling NaNs in a financial dataset - by larzz - Jun-04-2023, 06:11 PM
RE: Filling NaNs in a financial dataset - by larzz - Jun-05-2023, 12:09 PM
RE: Filling NaNs in a financial dataset - by larzz - Jun-06-2023, 08:31 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  how to extract financial data from photocopy of document angela1 6 3,794 Feb-15-2020, 05:50 PM
Last Post: jim2007
  Financial Modeling MarkHaversham 2 4,818 Feb-11-2020, 10:55 AM
Last Post: Mikhail_Shi
  Google Financial Client ian 7 6,436 Sep-21-2017, 07:23 PM
Last Post: Larz60+
  Filling in missing values melm0 4 5,329 Aug-09-2017, 03:59 PM
Last Post: radioactive9

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020