Python Forum - Filling NaNs in a financial dataset

Pages: 1 2

I have a financial dataframe consisting of 500 different columns. I have categorical and informational columns (in this case, the year, the quarter and the CIK code are the most important) and I have items from their financial report. Some of these columns have NaN values I want to replace. I thought the most reliable way to deal with these NaNs would be to take the average of the value reported the quarter before and the quarter after. To do this I wrote the following code:

def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index
    
    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']
                   
                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
    return merged_df_nan

In the first rows I am seperating the columns I want to fill with this method from the columns I do not want to alter at all. Then I first iterate over the columns and then over the rows. This is, however, incredibly slow and I am not even sure if this would work. Does anybody else have some suggestions?

(Jun-01-2023, 09:40 PM)larzz Wrote: [ -> ]Then I first iterate over the columns and then over the rows. This is, however, incredibly slow and I am not even sure if this would work. Does anybody else have some suggestions?

Yes,doing this way will be slow,DataFrame is vector based,so using vectorized operations with Pandas build in methods,will speed it up a lot.
Take a look at this article.
So it's hard without a sample of merged_df_nan to make changes and test,so with a little AI help here is two versions.

def average_col(merged_df_nan):
    # Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(
        ['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text',
         'stemmed_text', 'sentiment_score', 'fyr', 'indfmt', 'consol', 'popsrc', 'tic', 'conm', 'apdedateq', 'fdateq', 'rdq']
    ))].index

    for column in tqdm(columns_to_fill2, desc='Columns done'):
        na_mask = merged_df_nan[column].isna()
        previous_values = merged_df_nan.loc[~na_mask, [
            'cik', 'fyearq', 'fqtr', column]].set_index(['cik', 'fyearq', 'fqtr'])
        next_values = merged_df_nan.loc[~na_mask, [
            'cik', 'fyearq', 'fqtr', column]].set_index(['cik', 'fyearq', 'fqtr'])

        # Shift the index and values to find the previous and next non-NA values
        previous_values = previous_values.groupby('cik').shift(
            1).bfill().reset_index(level=[0, 1, 2], drop=True)
        next_values = next_values.groupby(
            'cik').shift(-1).ffill().reset_index(level=[0, 1, 2], drop=True)

        # Update NaN values with the average of previous and next values
        merged_df_nan.loc[na_mask, column] = (previous_values + next_values) / 2

    return merged_df_nan

Here to show a fully vectorized version,see that there is no loops.

def average_col(merged_df_nan):
    # Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(
        ['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text',
         'stemmed_text', 'sentiment_score', 'fyr', 'indfmt', 'consol', 'popsrc', 'tic', 'conm', 'apdedateq', 'fdateq', 'rdq']
    ))].index

    na_mask = merged_df_nan[columns_to_fill2].isna()
    fyearq = merged_df_nan['fyearq']
    fqtr = merged_df_nan['fqtr']
    cik = merged_df_nan['cik']

    # Find the previous and next non-NA values for each row using shift() and bfill()/ffill()
    prev_values = merged_df_nan[~na_mask].groupby(['cik', fyearq, fqtr])[
        columns_to_fill2].shift(1).bfill()
    next_values = merged_df_nan[~na_mask].groupby(['cik', fyearq, fqtr])[
        columns_to_fill2].shift(-1).ffill()

    # Fill NaN values with the average of previous and next values
    merged_df_nan.loc[na_mask] = (prev_values + next_values) / 2
    return merged_df_nan

(Jun-02-2023, 04:47 PM)snippsat Wrote: [ -> ]

(Jun-01-2023, 09:40 PM)larzz Wrote: [ -> ]Then I first iterate over the columns and then over the rows. This is, however, incredibly slow and I am not even sure if this would work. Does anybody else have some suggestions?

def average_col(merged_df_nan):
    # Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(
        ['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text',
         'stemmed_text', 'sentiment_score', 'fyr', 'indfmt', 'consol', 'popsrc', 'tic', 'conm', 'apdedateq', 'fdateq', 'rdq']
    ))].index

    for column in tqdm(columns_to_fill2, desc='Columns done'):
        na_mask = merged_df_nan[column].isna()
        previous_values = merged_df_nan.loc[~na_mask, [
            'cik', 'fyearq', 'fqtr', column]].set_index(['cik', 'fyearq', 'fqtr'])
        next_values = merged_df_nan.loc[~na_mask, [
            'cik', 'fyearq', 'fqtr', column]].set_index(['cik', 'fyearq', 'fqtr'])

        # Shift the index and values to find the previous and next non-NA values
        previous_values = previous_values.groupby('cik').shift(
            1).bfill().reset_index(level=[0, 1, 2], drop=True)
        next_values = next_values.groupby(
            'cik').shift(-1).ffill().reset_index(level=[0, 1, 2], drop=True)

        # Update NaN values with the average of previous and next values
        merged_df_nan.loc[na_mask, column] = (previous_values + next_values) / 2

    return merged_df_nan

Here to show a fully vectorized version,see that there is no loops.

def average_col(merged_df_nan):
    # Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(
        ['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text',
         'stemmed_text', 'sentiment_score', 'fyr', 'indfmt', 'consol', 'popsrc', 'tic', 'conm', 'apdedateq', 'fdateq', 'rdq']
    ))].index

    na_mask = merged_df_nan[columns_to_fill2].isna()
    fyearq = merged_df_nan['fyearq']
    fqtr = merged_df_nan['fqtr']
    cik = merged_df_nan['cik']

    # Find the previous and next non-NA values for each row using shift() and bfill()/ffill()
    prev_values = merged_df_nan[~na_mask].groupby(['cik', fyearq, fqtr])[
        columns_to_fill2].shift(1).bfill()
    next_values = merged_df_nan[~na_mask].groupby(['cik', fyearq, fqtr])[
        columns_to_fill2].shift(-1).ffill()

    # Fill NaN values with the average of previous and next values
    merged_df_nan.loc[na_mask] = (prev_values + next_values) / 2
    return merged_df_nan

Both solutions aren't really working. The first solution gives this error:

NotImplementedError: > 1 ndim Categorical are not supported at this time

The second one returns this error:

TypeError: unsupported operand type(s) for /: 'str' and 'int'

Would it be helpful if I provided the dataset? I don't know if it is allowed to upload it and post the downloadlink here...

(Jun-03-2023, 10:51 AM)larzz Wrote: [ -> ]Would it be helpful if I provided the dataset? I don't know if it is allowed to upload it and post the downloadlink here...

Yes,if big should post just a sample.

(Jun-03-2023, 12:56 PM)snippsat Wrote: [ -> ]
(Jun-03-2023, 10:51 AM)larzz Wrote: [ -> ]Would it be helpful if I provided the dataset? I don't know if it is allowed to upload it and post the downloadlink here...
Yes,if big should post just a sample.

[attachment=2389]

Here are the first 150 rows.

Do you do something to create merged_df_nan i mean for me it dos not work if i just read sample.xlsx?
Ohh 500 columns is hell of DataFrame,not gone look for error in that.

import pandas as pd
from tqdm import tqdm

def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index

    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']

                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
    return merged_df_nan

if __name__ == '__main__':
    merged_df_nan = pd.read_excel('sample.xlsx')
    res =  average_col(merged_df_nan)

Error:Traceback (most recent call last):
  File "G:\div_code\hex\game\av.py", line 28, in <module>
    res =  average_col(merged_df_nan)
  File "G:\div_code\hex\game\av.py", line 20, in average_col
    prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
  File "C:\Python310\lib\site-packages\pandas\core\base.py", line 349, in item
    raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar

(Jun-04-2023, 03:56 PM)snippsat Wrote: [ -> ]Do you do something to create merged_df_nan i mean for me it dos not work if i just read sample.xlsx?
Ohh 500 columns is hell of DataFrame,not gone look for error in that.

import pandas as pd
from tqdm import tqdm

def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index

    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']

                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
    return merged_df_nan

if __name__ == '__main__':
    merged_df_nan = pd.read_excel('sample.xlsx')
    res =  average_col(merged_df_nan)

Error:Traceback (most recent call last):
  File "G:\div_code\hex\game\av.py", line 28, in <module>
    res =  average_col(merged_df_nan)
  File "G:\div_code\hex\game\av.py", line 20, in average_col
    prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
  File "C:\Python310\lib\site-packages\pandas\core\base.py", line 349, in item
    raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar

Yeah I did some additions. Do you think you can help if I upload that one?

(Jun-04-2023, 06:11 PM)larzz Wrote: [ -> ]Yeah I did some additions. Do you think you can help if I upload that one?

Maybe,but not sure as it hard a task when DataFrame is 500 columns wide.
Than at least i or someone else can may take a look at look it.
If you could make the task wanted in smaller scale,it would be fine.
A example of this could be like this.

import pandas as pd

df = pd.read_clipboard()
'''
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431
'''

So here have DataFrame with some NaN values,an want to fill in with average values that columns has.

>>> df
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431
>>> df.mean()
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64
>>> df.fillna(df.mean())
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

(Jun-05-2023, 09:14 AM)snippsat Wrote: [ -> ]

(Jun-04-2023, 06:11 PM)larzz Wrote: [ -> ]Yeah I did some additions. Do you think you can help if I upload that one?

import pandas as pd

df = pd.read_clipboard()
'''
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431
'''

So here have DataFrame with some NaN values,an want to fill in with average values that columns has.

>>> df
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431
>>> df.mean()
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64
>>> df.fillna(df.mean())
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

I have a bigger file that isn't allowed to be uploaded through this site so I did it like this:

https://file.io/cNtj51B5Tswo

If it isn't allowed you can just remove it. This contains 1000 rows of the dataframe I am currently using. I understand that making the task on a smaller scale would be easier, but it shouldn't be that hard to upscale right?

(Jun-05-2023, 12:09 PM)larzz Wrote: [ -> ]I have a bigger file that isn't allowed to be uploaded through this site so I did it like this:

You missed the point,you should post code that you use to create merged_df_nan.
Also working code for function average_col(merged_df_nan) using sample.xlsx that you have posted.

(Jun-05-2023, 12:09 PM)larzz Wrote: [ -> ]https://file.io/cNtj51B5Tswo

Link dos not work.

Pages: 1 2