Jan-13-2020, 07:30 AM
(This post was last modified: Jan-13-2020, 07:30 AM by donnertrud.)
Hi guys,
I am working on a Regression task where one has to predict the number of likes of an Instagram pictures based on features which are given in a dataframe. I have attached a small part of that dataframe to give you an better idea.
![[Image: 85hnHxf]](https://ibb.co/85hnHxf)
This is what my code looks like. Its pretty simply and as in the title stated the R2 score is pretty good (0.93), but as soon as I try to predict the likes given random input data, the model always predicts +- the average number of likes. E.g. it can't predict the lower and higher values of likes. Unfortunately I can't figure out the problem and I would really appreciate some ideas what the problem might be. Thanks in advance!
I am working on a Regression task where one has to predict the number of likes of an Instagram pictures based on features which are given in a dataframe. I have attached a small part of that dataframe to give you an better idea.
This is what my code looks like. Its pretty simply and as in the title stated the R2 score is pretty good (0.93), but as soon as I try to predict the likes given random input data, the model always predicts +- the average number of likes. E.g. it can't predict the lower and higher values of likes. Unfortunately I can't figure out the problem and I would really appreciate some ideas what the problem might be. Thanks in advance!
# load data df = pd.read_csv("C:/Users/Flo/Desktop/SeminarORIGINAL/data/stud_df_train.csv") # drop the columns which will not be useful for further analysis new_df = df.drop(df[["image_height","image_path", 'image_width', 'image_upload_date', 'account_name', 'image_comments']], axis=1) # create dummy variables for background and account category one_hot = pd.get_dummies(new_df[['image_Background','account_category']]) # Drop both columns as they are now encoded df = new_df.drop(new_df[['image_Background','account_category']], axis = 1) # add the encoded columns in dataframe data = df.join(one_hot) # bring data into X_train, y_train format Y_train = data["image_likes"].values X_train = data.drop(["image_likes"], axis=1).values # Normalizing data with scikit-learn from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)).fit(X_train) X_train = scaler.fit_transform(X_train) # create and fit the random forest regressor model rf = RandomForestRegressor(n_estimators=100) rf.fit(X_train, Y_train) # predict y_values y_pred = rf.predict(X_train) print("R2: ", r2_score(Y_train, y_pred)) print("MAE: ", mean_absolute_error(Y_train, y_pred)) print("MSE: ", mean_squared_error(Y_train, y_pred))