May-04-2017, 04:14 PM
Hello python experts,
I'm relatively new to python but have to solve a problem for a university project. I hope you guys can help me.
My task is to do a 10fold cross-validation on a time series in which 90% should be training data and 10% should be for testing. In the end I should evaluate the testing set with the RMSE. Furhtermore, the data should not be shuffeld, as it is a time series. After doing some research I came up with this:
I hope you can help me. Thanks in advance!
I'm relatively new to python but have to solve a problem for a university project. I hope you guys can help me.
My task is to do a 10fold cross-validation on a time series in which 90% should be training data and 10% should be for testing. In the end I should evaluate the testing set with the RMSE. Furhtermore, the data should not be shuffeld, as it is a time series. After doing some research I came up with this:
X = mergedf['AnzahlTweets'] y = mergedf['Kurs'] for train_index, test_index in tscv.split(X): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] linreg=LinearRegression() linreg.fit(X_train,y_train) y_pred=linreg.predict(X_test) print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))Now I got two questions: is this the right approach for my task and if so, why do I get this error 'ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'? It seems like X_train, X_test, y_train, y_test just consists of 'NaN' although X and y have values.
I hope you can help me. Thanks in advance!