Filling NaNs in a financial dataset

larzz · Jun-01-2023, 09:40 PM

I have a financial dataframe consisting of 500 different columns. I have categorical and informational columns (in this case, the year, the quarter and the CIK code are the most important) and I have items from their financial report. Some of these columns have NaN values I want to replace. I thought the most reliable way to deal with these NaNs would be to take the average of the value reported the quarter before and the quarter after. To do this I wrote the following code:

def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index
    
    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']
                   
                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
    return merged_df_nan

In the first rows I am seperating the columns I want to fill with this method from the columns I do not want to alter at all. Then I first iterate over the columns and then over the rows. This is, however, incredibly slow and I am not even sure if this would work. Does anybody else have some suggestions?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	how to extract financial data from photocopy of document	angela1	6	3,794	Feb-15-2020, 05:50 PM Last Post: jim2007
	Financial Modeling	MarkHaversham	2	4,818	Feb-11-2020, 10:55 AM Last Post: Mikhail_Shi
	Google Financial Client	ian	7	6,436	Sep-21-2017, 07:23 PM Last Post: Larz60+
	Filling in missing values	melm0	4	5,329	Aug-09-2017, 03:59 PM Last Post: radioactive9

Filling NaNs in a financial dataset

User Panel Messages

Announcements