Python Forum
Kaggle Titanic - new category placement
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Kaggle Titanic - new category placement
#1
I'm pretty new to Python and even newer to machine learning so apologies if the answer to my question is obvious.

I've performed the common step of combining SibSp and Parch into a single 'family_size' column. It helped a lot. Where I place the code for this step seems to have a big effect on my final result and I can't see why. Can anyone help me out understanding? I also don't understand why my final Kaggle score drops significantly when I drop the PassengerId column but I'm not concerned about that right now.

I won't paste in all my code but essentially what I do in this 'test_data' section I do in the 'train_data'.

If I place
test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]
at the top of my code I get a final score (on Kaggle, that is) of 0.767..., if I place it at the bottom I get a final score of 0.789...
I've commented out the lower performing line.

########## Calibrate Test Data ##########
##########################################

# Create new categories
#test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]

# Drop irrelevant categories
test_data = test_data.drop(["Name", "Ticket", "Cabin"], axis=1)

# Use SimpleImputer to fill missing data
x = imputer.transform(test_data)
test_data_transformed = pd.DataFrame(x, columns=test_data.columns, index=test_data.index)

# Use OneHotEncoder to encode alpha data, and numerical with more than three possibilities
test_data_encoded = pd.DataFrame(encoder.fit_transform(test_data_transformed[["Pclass", "Sex", "Embarked"]]))
test_data_transformed_encoded = test_data_transformed.join(test_data_encoded)
test_data_transformed_encoded = test_data_transformed_encoded.drop(["Pclass", "Sex", "Embarked"], axis=1)

# Create new categories
test_data_transformed_encoded["family_size"] = test_data_transformed_encoded["SibSp"] + test_data_transformed_encoded["Parch"]
test_data_transformed_encoded = test_data_transformed_encoded.drop(["SibSp", "Parch"], axis=1)

# Scale the data
prepared_test_data = scaler.fit_transform(test_data_transformed_encoded.astype(np.float64))

########## FINAL MODEL ##########
#################################

forest_clf = RandomForestClassifier(random_state=42)

# Fine Tune using GridSearch 
param_grid = [{'n_estimators': [10, 30, 60], 'max_features': [1, 2, 4, 8, 14]},]
grid_search = GridSearchCV(forest_clf, param_grid, cv=3)
grid_search.fit(prepared_train_data_predictors, train_data_labels)

final_model = grid_search.best_estimator_
Thank you!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas - compute means per category and time rama27 7 3,419 Nov-13-2020, 08:55 AM
Last Post: PsyPy
  DataFrame.astype('category') duplicates column garikhgh0 3 3,645 Apr-18-2018, 11:35 PM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020