Bottom Page

• 0 Vote(s) - 0 Average
• 1
• 2
• 3
• 4
• 5
 Random Forest high R2 Score but poor prediction donnertrud Silly Frenchman Posts: 21 Threads: 14 Joined: Dec 2019 Reputation: 0 Likes received: 0 #1 Jan-13-2020, 07:30 AM (This post was last modified: Jan-13-2020, 07:30 AM by donnertrud. Edited 1 time in total.) Hi guys, I am working on a Regression task where one has to predict the number of likes of an Instagram pictures based on features which are given in a dataframe. I have attached a small part of that dataframe to give you an better idea. This is what my code looks like. Its pretty simply and as in the title stated the R2 score is pretty good (0.93), but as soon as I try to predict the likes given random input data, the model always predicts +- the average number of likes. E.g. it can't predict the lower and higher values of likes. Unfortunately I can't figure out the problem and I would really appreciate some ideas what the problem might be. Thanks in advance! ```# load data df = pd.read_csv("C:/Users/Flo/Desktop/SeminarORIGINAL/data/stud_df_train.csv") # drop the columns which will not be useful for further analysis new_df = df.drop(df[["image_height","image_path", 'image_width', 'image_upload_date', 'account_name', 'image_comments']], axis=1) # create dummy variables for background and account category one_hot = pd.get_dummies(new_df[['image_Background','account_category']]) # Drop both columns as they are now encoded df = new_df.drop(new_df[['image_Background','account_category']], axis = 1) # add the encoded columns in dataframe data = df.join(one_hot) # bring data into X_train, y_train format Y_train = data["image_likes"].values X_train = data.drop(["image_likes"], axis=1).values # Normalizing data with scikit-learn from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)).fit(X_train) X_train = scaler.fit_transform(X_train) # create and fit the random forest regressor model rf = RandomForestRegressor(n_estimators=100) rf.fit(X_train, Y_train) # predict y_values y_pred = rf.predict(X_train) print("R2: ", r2_score(Y_train, y_pred)) print("MAE: ", mean_absolute_error(Y_train, y_pred)) print("MSE: ", mean_squared_error(Y_train, y_pred)) ``` jefsummers Verb Conjugator Posts: 688 Threads: 1 Joined: May 2019 Reputation: 67 Likes received: 94 #2 Jan-13-2020, 04:31 PM When I get good training but bad prediction, I immediately think of overfitting. Have you tried reducing the number of estimators? What if you put lines 22-31 in a loop that ranged n_estimators by 10s from 10 to 100 to see what happens (or get fancy, do the loop but graph your results using matplotlib)? donnertrud Silly Frenchman Posts: 21 Threads: 14 Joined: Dec 2019 Reputation: 0 Likes received: 0 #3 Jan-13-2020, 04:45 PM (This post was last modified: Jan-13-2020, 04:45 PM by donnertrud. Edited 2 times in total.) Thanks for your response ! Instead of coding a loop which goes through different estimators, couldn't I run a RandomSearch with ranges for relevant RF parameters and looking for the best one with the best.paras_ command? Because I already did that a couple of times now and the R2 and MAE got even worse, surprisingly. I might have to drop some features and try the random serch again. e.g. the code would look something like this : ```# Number of trees in random forest n_estimators = [int(x) for x in np.linspace(start = 200, stop = 5000)] # Number of features to consider at every split max_features = ['auto', 'sqrt', 'log2'] # Maximum number of levels in tree max_depth = [int(x) for x in np.linspace(10, 110)] max_depth.append(None) # Minimum number of samples required to split a node min_samples_split = [2, 5, 10, 15, 20] # Minimum number of samples required at each leaf node min_samples_leaf = [1, 2, 5, 10, 15] # Method of selecting samples for training each tree bootstrap = [True, False]# Create the random grid random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap} rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1) search = rf_random.fit(X_train, y_train) search.best_params_``` jefsummers Verb Conjugator Posts: 688 Threads: 1 Joined: May 2019 Reputation: 67 Likes received: 94 #4 Jan-13-2020, 09:03 PM But you've done that and it did not work. Random forests are supposed to be relatively resistant to overfitting, true. But getting better on the training data while getting worse on the validation/test data means the model is fitting closer and closer to the training data, while the validation data is different enough to give you bad results. donnertrud Silly Frenchman Posts: 21 Threads: 14 Joined: Dec 2019 Reputation: 0 Likes received: 0 #5 Jan-13-2020, 09:58 PM If one is looking for the best parameters, is there a way to search for the best parameters which yield the lowest Mean Absolute Error ? Or is this done by default? jefsummers Verb Conjugator Posts: 688 Threads: 1 Joined: May 2019 Reputation: 67 Likes received: 94 #6 Jan-13-2020, 11:23 PM You are stretching me - good. I think it is usually loss that is plotted, but if you vary the parameter of interest and plot loss (or mse) against that, there is usually an elbow. Pick the low point of the elbow. « Next Oldest | Next Newest »

Top Page

 Possibly Related Threads... Thread Author Replies Views Last Post Can't make Random Forest Prediction work donnertrud 0 319 May-23-2020, 12:26 PM Last Post: donnertrud Prediction of Coal Fire Power Plant Pollutants Emission Dalpi 2 427 May-08-2020, 06:28 PM Last Post: Dalpi prediction using linear regression (extrapolation?) in a loop karlito 0 620 Feb-05-2020, 10:56 AM Last Post: karlito Random Forest Hyperparamter Optimization donnertrud 1 301 Jan-17-2020, 06:30 AM Last Post: scidam AUCPR of individual features using Random Forest (Error: unhashable Type) melissa 1 1,483 Jul-10-2017, 12:48 PM Last Post: sparkz_alot

Forum Jump:

Users browsing this thread: 1 Guest(s)