Random Forest high R2 Score but poor prediction

donnertrud · (This post was last modified: Jan-13-2020, 07:30 AM by donnertrud.)

Hi guys,

I am working on a Regression task where one has to predict the number of likes of an Instagram pictures based on features which are given in a dataframe. I have attached a small part of that dataframe to give you an better idea.

[Image: 85hnHxf]

This is what my code looks like. Its pretty simply and as in the title stated the R2 score is pretty good (0.93), but as soon as I try to predict the likes given random input data, the model always predicts +- the average number of likes. E.g. it can't predict the lower and higher values of likes. Unfortunately I can't figure out the problem and I would really appreciate some ideas what the problem might be. Thanks in advance!

# load data
df = pd.read_csv("C:/Users/Flo/Desktop/SeminarORIGINAL/data/stud_df_train.csv")

# drop the columns which will not be useful for further analysis 
new_df = df.drop(df[["image_height","image_path", 'image_width', 'image_upload_date', 'account_name', 'image_comments']], axis=1)
# create dummy variables for background and account category
one_hot = pd.get_dummies(new_df[['image_Background','account_category']])
# Drop both columns as they are now encoded
df = new_df.drop(new_df[['image_Background','account_category']], axis = 1)
# add the encoded columns in dataframe 
data = df.join(one_hot)

# bring data into X_train, y_train format
Y_train = data["image_likes"].values
X_train = data.drop(["image_likes"], axis=1).values

# Normalizing data with scikit-learn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1)).fit(X_train)
X_train = scaler.fit_transform(X_train)

# create and fit the random forest regressor model
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, Y_train)

# predict y_values
y_pred = rf.predict(X_train)

print("R2: ", r2_score(Y_train, y_pred))
print("MAE: ", mean_absolute_error(Y_train, y_pred))
print("MSE: ", mean_squared_error(Y_train, y_pred))

jefsummers · Jan-13-2020, 04:31 PM

When I get good training but bad prediction, I immediately think of overfitting. Have you tried reducing the number of estimators? What if you put lines 22-31 in a loop that ranged n_estimators by 10s from 10 to 100 to see what happens (or get fancy, do the loop but graph your results using matplotlib)?

donnertrud · (This post was last modified: Jan-13-2020, 04:45 PM by donnertrud.)

Thanks for your response !
Instead of coding a loop which goes through different estimators, couldn't I run a RandomSearch with ranges for relevant RF parameters and looking for the best one with the best.paras_ command? Because I already did that a couple of times now and the R2 and MAE got even worse, surprisingly. I might have to drop some features and try the random serch again.

e.g. the code would look something like this :

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 5000)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 20]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 15]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
search = rf_random.fit(X_train, y_train)
search.best_params_

jefsummers · Jan-13-2020, 09:03 PM

But you've done that and it did not work.
Random forests are supposed to be relatively resistant to overfitting, true. But getting better on the training data while getting worse on the validation/test data means the model is fitting closer and closer to the training data, while the validation data is different enough to give you bad results.

donnertrud · Jan-13-2020, 09:58 PM

If one is looking for the best parameters, is there a way to search for the best parameters which yield the lowest Mean Absolute Error ? Or is this done by default?

jefsummers · Jan-13-2020, 11:23 PM

You are stretching me - good. I think it is usually loss that is plotted, but if you vary the parameter of interest and plot loss (or mse) against that, there is usually an elbow. Pick the low point of the elbow.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Random Forest to Identify Page: Feature Selection	JaneTan	0	1,309	Oct-14-2021, 09:40 AM Last Post: JaneTan
	Can't make Random Forest Prediction work	donnertrud	0	1,627	May-23-2020, 12:26 PM Last Post: donnertrud
	Prediction of Coal Fire Power Plant Pollutants Emission	Dalpi	2	2,156	May-08-2020, 06:28 PM Last Post: Dalpi
	prediction using linear regression (extrapolation?) in a loop	karlito	0	3,228	Feb-05-2020, 10:56 AM Last Post: karlito
	Random Forest Hyperparamter Optimization	donnertrud	1	1,943	Jan-17-2020, 06:30 AM Last Post: scidam
	Difference between R^2 and .score	donnertrud	1	6,902	Jan-08-2020, 05:14 PM Last Post: jefsummers
	AUCPR of individual features using Random Forest (Error: unhashable Type)	melissa	1	3,313	Jul-10-2017, 12:48 PM Last Post: sparkz_alot

Random Forest high R2 Score but poor prediction

User Panel Messages

Announcements