Python Forum

Hello All,
I am new to python programming and I am trying to solve the Titanic data set from Kaggle for self-learning.
The columns of train_df are ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
'Ticket' 'Fare' 'Cabin' 'Embarked']

Out of these only Age has some missing values. I have found out the median of the ages based on the Passenger class and Sex and stored as temp_df
temp_df=train_df[['Pclass', 'Sex','Age']].groupby(['Pclass','Sex']).median().reset_index()

Pclass Sex Age
1 female 35
1 male 40
2 female 28
2 male 30
3 female 21.5
3 male 25

I have tried many ways but not able to write a python code to update the missing Age values in train_df when the criteria match.
Can you please help me with a python code for the above bottleneck.
Thank you in advance for your time and reply.

Regards,
Parth

You are probably looking for this:

train_df.Age.fillna(train_df.groupby(['Sex','Pclass]).transform('median').Age, inplace=True)

# from now train_df.Age doesn't contain nans

I would suggest you to take into account 'title' property, e.g. Masters are young people, etc.
Another suggestion is to use combined dataset (from train and test ones) to get 'median' estimations, i.e.
something like this

train_df.Age.fillna(pd.concat([train_df, test_df]).groupby(['Sex','Pclass']).transform('median').Age.iloc[:train_df.shape[0]], inplace=True)

(Nov-21-2018, 01:30 AM)scidam Wrote: [ -> ]You are probably looking for this:

train_df.Age.fillna(train_df.groupby(['Sex','Pclass]).transform('median').Age, inplace=True)

train_df.Age.fillna(pd.concat([train_df, test_df]).groupby(['Sex','Pclass']).transform('median').Age.iloc[:train_df.shape[0]], inplace=True)

Thank You

Parthasarathi009

scidam

Parthasarathi009