Jun-01-2023, 09:40 PM
I have a financial dataframe consisting of 500 different columns. I have categorical and informational columns (in this case, the year, the quarter and the CIK code are the most important) and I have items from their financial report. Some of these columns have NaN values I want to replace. I thought the most reliable way to deal with these NaNs would be to take the average of the value reported the quarter before and the quarter after. To do this I wrote the following code:
def average_col(merged_df_nan): #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after na_counts = merged_df_nan.isna().sum() columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values for column in tqdm(columns_to_fill2, desc = 'Columns done'): na_mask = merged_df_nan[column].isna() for index, row in merged_df_nan.iterrows(): if na_mask[index]: fyearq = row['fyearq'] fqtr = row['fqtr'] cik = row['cik'] #Looking for the values and taking the average of them prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item() next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item() average_value = np.mean([prev_value, next_value]) merged_df_nan.loc[index, column] = average_value return merged_df_nanIn the first rows I am seperating the columns I want to fill with this method from the columns I do not want to alter at all. Then I first iterate over the columns and then over the rows. This is, however, incredibly slow and I am not even sure if this would work. Does anybody else have some suggestions?