Python Forum

Full Version: Kaggle Titanic - new category placement
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm pretty new to Python and even newer to machine learning so apologies if the answer to my question is obvious.

I've performed the common step of combining SibSp and Parch into a single 'family_size' column. It helped a lot. Where I place the code for this step seems to have a big effect on my final result and I can't see why. Can anyone help me out understanding? I also don't understand why my final Kaggle score drops significantly when I drop the PassengerId column but I'm not concerned about that right now.

I won't paste in all my code but essentially what I do in this 'test_data' section I do in the 'train_data'.

If I place
test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]
at the top of my code I get a final score (on Kaggle, that is) of 0.767..., if I place it at the bottom I get a final score of 0.789...
I've commented out the lower performing line.

########## Calibrate Test Data ##########
##########################################

# Create new categories
#test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]

# Drop irrelevant categories
test_data = test_data.drop(["Name", "Ticket", "Cabin"], axis=1)

# Use SimpleImputer to fill missing data
x = imputer.transform(test_data)
test_data_transformed = pd.DataFrame(x, columns=test_data.columns, index=test_data.index)

# Use OneHotEncoder to encode alpha data, and numerical with more than three possibilities
test_data_encoded = pd.DataFrame(encoder.fit_transform(test_data_transformed[["Pclass", "Sex", "Embarked"]]))
test_data_transformed_encoded = test_data_transformed.join(test_data_encoded)
test_data_transformed_encoded = test_data_transformed_encoded.drop(["Pclass", "Sex", "Embarked"], axis=1)

# Create new categories
test_data_transformed_encoded["family_size"] = test_data_transformed_encoded["SibSp"] + test_data_transformed_encoded["Parch"]
test_data_transformed_encoded = test_data_transformed_encoded.drop(["SibSp", "Parch"], axis=1)

# Scale the data
prepared_test_data = scaler.fit_transform(test_data_transformed_encoded.astype(np.float64))

########## FINAL MODEL ##########
#################################

forest_clf = RandomForestClassifier(random_state=42)

# Fine Tune using GridSearch 
param_grid = [{'n_estimators': [10, 30, 60], 'max_features': [1, 2, 4, 8, 14]},]
grid_search = GridSearchCV(forest_clf, param_grid, cv=3)
grid_search.fit(prepared_train_data_predictors, train_data_labels)

final_model = grid_search.best_estimator_
Thank you!