Python Forum

Full Version: help for Kaggle Titanic Set fill the missing Age by median age of Pclass and Sex
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello All,
I am new to python programming and I am trying to solve the Titanic data set from Kaggle for self-learning.
The columns of train_df are ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
'Ticket' 'Fare' 'Cabin' 'Embarked']

Out of these only Age has some missing values. I have found out the median of the ages based on the Passenger class and Sex and stored as temp_df
temp_df=train_df[['Pclass', 'Sex','Age']].groupby(['Pclass','Sex']).median().reset_index()

Pclass Sex Age
1 female 35
1 male 40
2 female 28
2 male 30
3 female 21.5
3 male 25

I have tried many ways but not able to write a python code to update the missing Age values in train_df when the criteria match.
Can you please help me with a python code for the above bottleneck.
Thank you in advance for your time and reply.

Regards,
Parth
You are probably looking for this:

train_df.Age.fillna(train_df.groupby(['Sex','Pclass]).transform('median').Age, inplace=True)
# from now train_df.Age doesn't contain nans

I would suggest you to take into account 'title' property, e.g. Masters are young people, etc.
Another suggestion is to use combined dataset (from train and test ones) to get 'median' estimations, i.e.
something like this
train_df.Age.fillna(pd.concat([train_df, test_df]).groupby(['Sex','Pclass']).transform('median').Age.iloc[:train_df.shape[0]], inplace=True)
(Nov-21-2018, 01:30 AM)scidam Wrote: [ -> ]You are probably looking for this:

train_df.Age.fillna(train_df.groupby(['Sex','Pclass]).transform('median').Age, inplace=True)
train_df.Age.fillna(pd.concat([train_df, test_df]).groupby(['Sex','Pclass']).transform('median').Age.iloc[:train_df.shape[0]], inplace=True)
Dance Dance Dance Dance Thank You