I'm pretty new to Python and even newer to machine learning so apologies if the answer to my question is obvious.
I've performed the common step of combining SibSp and Parch into a single 'family_size' column. It helped a lot. Where I place the code for this step seems to have a big effect on my final result and I can't see why. Can anyone help me out understanding? I also don't understand why my final Kaggle score drops significantly when I drop the PassengerId column but I'm not concerned about that right now.
I won't paste in all my code but essentially what I do in this 'test_data' section I do in the 'train_data'.
If I place
I've commented out the lower performing line.
I've performed the common step of combining SibSp and Parch into a single 'family_size' column. It helped a lot. Where I place the code for this step seems to have a big effect on my final result and I can't see why. Can anyone help me out understanding? I also don't understand why my final Kaggle score drops significantly when I drop the PassengerId column but I'm not concerned about that right now.
I won't paste in all my code but essentially what I do in this 'test_data' section I do in the 'train_data'.
If I place
test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]at the top of my code I get a final score (on Kaggle, that is) of 0.767..., if I place it at the bottom I get a final score of 0.789...
I've commented out the lower performing line.
########## Calibrate Test Data ########## ########################################## # Create new categories #test_data["family_size"] = test_data["SibSp"] + test_data["Parch"] # Drop irrelevant categories test_data = test_data.drop(["Name", "Ticket", "Cabin"], axis=1) # Use SimpleImputer to fill missing data x = imputer.transform(test_data) test_data_transformed = pd.DataFrame(x, columns=test_data.columns, index=test_data.index) # Use OneHotEncoder to encode alpha data, and numerical with more than three possibilities test_data_encoded = pd.DataFrame(encoder.fit_transform(test_data_transformed[["Pclass", "Sex", "Embarked"]])) test_data_transformed_encoded = test_data_transformed.join(test_data_encoded) test_data_transformed_encoded = test_data_transformed_encoded.drop(["Pclass", "Sex", "Embarked"], axis=1) # Create new categories test_data_transformed_encoded["family_size"] = test_data_transformed_encoded["SibSp"] + test_data_transformed_encoded["Parch"] test_data_transformed_encoded = test_data_transformed_encoded.drop(["SibSp", "Parch"], axis=1) # Scale the data prepared_test_data = scaler.fit_transform(test_data_transformed_encoded.astype(np.float64)) ########## FINAL MODEL ########## ################################# forest_clf = RandomForestClassifier(random_state=42) # Fine Tune using GridSearch param_grid = [{'n_estimators': [10, 30, 60], 'max_features': [1, 2, 4, 8, 14]},] grid_search = GridSearchCV(forest_clf, param_grid, cv=3) grid_search.fit(prepared_train_data_predictors, train_data_labels) final_model = grid_search.best_estimator_Thank you!