Python Forum
Random Forest high R2 Score but poor prediction
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Random Forest high R2 Score but poor prediction
#1
Hi guys,

I am working on a Regression task where one has to predict the number of likes of an Instagram pictures based on features which are given in a dataframe. I have attached a small part of that dataframe to give you an better idea.

[Image: 85hnHxf]

This is what my code looks like. Its pretty simply and as in the title stated the R2 score is pretty good (0.93), but as soon as I try to predict the likes given random input data, the model always predicts +- the average number of likes. E.g. it can't predict the lower and higher values of likes. Unfortunately I can't figure out the problem and I would really appreciate some ideas what the problem might be. Thanks in advance!

# load data
df = pd.read_csv("C:/Users/Flo/Desktop/SeminarORIGINAL/data/stud_df_train.csv")

# drop the columns which will not be useful for further analysis 
new_df = df.drop(df[["image_height","image_path", 'image_width', 'image_upload_date', 'account_name', 'image_comments']], axis=1)
# create dummy variables for background and account category
one_hot = pd.get_dummies(new_df[['image_Background','account_category']])
# Drop both columns as they are now encoded
df = new_df.drop(new_df[['image_Background','account_category']], axis = 1)
# add the encoded columns in dataframe 
data = df.join(one_hot)

# bring data into X_train, y_train format
Y_train = data["image_likes"].values
X_train = data.drop(["image_likes"], axis=1).values

# Normalizing data with scikit-learn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1)).fit(X_train)
X_train = scaler.fit_transform(X_train)

# create and fit the random forest regressor model
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, Y_train)

# predict y_values
y_pred = rf.predict(X_train)

print("R2: ", r2_score(Y_train, y_pred))
print("MAE: ", mean_absolute_error(Y_train, y_pred))
print("MSE: ", mean_squared_error(Y_train, y_pred))
Reply
#2
When I get good training but bad prediction, I immediately think of overfitting. Have you tried reducing the number of estimators? What if you put lines 22-31 in a loop that ranged n_estimators by 10s from 10 to 100 to see what happens (or get fancy, do the loop but graph your results using matplotlib)?
Reply
#3
Thanks for your response !
Instead of coding a loop which goes through different estimators, couldn't I run a RandomSearch with ranges for relevant RF parameters and looking for the best one with the best.paras_ command? Because I already did that a couple of times now and the R2 and MAE got even worse, surprisingly. I might have to drop some features and try the random serch again.

e.g. the code would look something like this :

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 5000)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 20]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 15]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
search = rf_random.fit(X_train, y_train)
search.best_params_
Reply
#4
But you've done that and it did not work.
Random forests are supposed to be relatively resistant to overfitting, true. But getting better on the training data while getting worse on the validation/test data means the model is fitting closer and closer to the training data, while the validation data is different enough to give you bad results.
Reply
#5
If one is looking for the best parameters, is there a way to search for the best parameters which yield the lowest Mean Absolute Error ? Or is this done by default?
Reply
#6
You are stretching me - good. I think it is usually loss that is plotted, but if you vary the parameter of interest and plot loss (or mse) against that, there is usually an elbow. Pick the low point of the elbow.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Random Forest to Identify Page: Feature Selection JaneTan 0 1,290 Oct-14-2021, 09:40 AM
Last Post: JaneTan
  Can't make Random Forest Prediction work donnertrud 0 1,595 May-23-2020, 12:26 PM
Last Post: donnertrud
  Prediction of Coal Fire Power Plant Pollutants Emission Dalpi 2 2,116 May-08-2020, 06:28 PM
Last Post: Dalpi
  prediction using linear regression (extrapolation?) in a loop karlito 0 3,179 Feb-05-2020, 10:56 AM
Last Post: karlito
  Random Forest Hyperparamter Optimization donnertrud 1 1,902 Jan-17-2020, 06:30 AM
Last Post: scidam
  Difference between R^2 and .score donnertrud 1 6,819 Jan-08-2020, 05:14 PM
Last Post: jefsummers
  AUCPR of individual features using Random Forest (Error: unhashable Type) melissa 1 3,279 Jul-10-2017, 12:48 PM
Last Post: sparkz_alot

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020