Kaggle Titanic - new category placement

snakes · (This post was last modified: Oct-18-2021, 07:53 PM by snakes.)

I'm pretty new to Python and even newer to machine learning so apologies if the answer to my question is obvious.

I've performed the common step of combining SibSp and Parch into a single 'family_size' column. It helped a lot. Where I place the code for this step seems to have a big effect on my final result and I can't see why. Can anyone help me out understanding? I also don't understand why my final Kaggle score drops significantly when I drop the PassengerId column but I'm not concerned about that right now.

I won't paste in all my code but essentially what I do in this 'test_data' section I do in the 'train_data'.

If I place

test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]

at the top of my code I get a final score (on Kaggle, that is) of 0.767..., if I place it at the bottom I get a final score of 0.789...
I've commented out the lower performing line.

########## Calibrate Test Data ##########
##########################################

# Create new categories
#test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]

# Drop irrelevant categories
test_data = test_data.drop(["Name", "Ticket", "Cabin"], axis=1)

# Use SimpleImputer to fill missing data
x = imputer.transform(test_data)
test_data_transformed = pd.DataFrame(x, columns=test_data.columns, index=test_data.index)

# Use OneHotEncoder to encode alpha data, and numerical with more than three possibilities
test_data_encoded = pd.DataFrame(encoder.fit_transform(test_data_transformed[["Pclass", "Sex", "Embarked"]]))
test_data_transformed_encoded = test_data_transformed.join(test_data_encoded)
test_data_transformed_encoded = test_data_transformed_encoded.drop(["Pclass", "Sex", "Embarked"], axis=1)

# Create new categories
test_data_transformed_encoded["family_size"] = test_data_transformed_encoded["SibSp"] + test_data_transformed_encoded["Parch"]
test_data_transformed_encoded = test_data_transformed_encoded.drop(["SibSp", "Parch"], axis=1)

# Scale the data
prepared_test_data = scaler.fit_transform(test_data_transformed_encoded.astype(np.float64))

########## FINAL MODEL ##########
#################################

forest_clf = RandomForestClassifier(random_state=42)

# Fine Tune using GridSearch 
param_grid = [{'n_estimators': [10, 30, 60], 'max_features': [1, 2, 4, 8, 14]},]
grid_search = GridSearchCV(forest_clf, param_grid, cv=3)
grid_search.fit(prepared_train_data_predictors, train_data_labels)

final_model = grid_search.best_estimator_

Thank you!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pandas - compute means per category and time	rama27	7	5,264	Nov-13-2020, 08:55 AM Last Post: PsyPy
	DataFrame.astype('category') duplicates column	garikhgh0	3	4,824	Apr-18-2018, 11:35 PM Last Post: scidam

Kaggle Titanic - new category placement

User Panel Messages

Announcements