10fold cross-validation on time series

ulrich48155 · May-04-2017, 04:14 PM

Hello python experts,
I'm relatively new to python but have to solve a problem for a university project. I hope you guys can help me.

My task is to do a 10fold cross-validation on a time series in which 90% should be training data and 10% should be for testing. In the end I should evaluate the testing set with the RMSE. Furhtermore, the data should not be shuffeld, as it is a time series. After doing some research I came up with this:

X = mergedf['AnzahlTweets']
y = mergedf['Kurs']

for train_index, test_index in tscv.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

linreg=LinearRegression()
linreg.fit(X_train,y_train)
y_pred=linreg.predict(X_test)
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Now I got two questions: is this the right approach for my task and if so, why do I get this error 'ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'? It seems like X_train, X_test, y_train, y_test just consists of 'NaN' although X and y have values.

I hope you can help me. Thanks in advance!

***zivoni*** · May-04-2017, 09:16 PM

You do not show your input data, so its hard to say what is going wrong. As sklearn works with numpy arrays, indices from tcsv.split() will be integers in the range(0, len(df)) (assuming that tcsv is instance of sklearn.model_selection.TimeSeriesSplit). If your dataframe has index with different values than 0...len(df)-1, then subsetting with test_index or train_index would lead to NaN for some index values.

You probably want to put model fitting/predicting and RMSE inside your for loop, so you use your folds and train 9? models (now all you do is prediction/evaluation on last fold). And for time series its not exactly 90% train data, 10% test data - as you generally want to avoid prediction based on "future" data, for testing on k-th fold you can use only data from fold 1 to fold k-1. So first train fold is first 10% of data, first test fold is following 10% of data, second train fold is first 20% of data, second test fold is following 10% of data and so on.

ulrich48155 · May-05-2017, 01:24 PM

Thanks for your reply! I should've posted the whole code. It looks like this:

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit

# Preparing data
tweets=pd.read_csv('numTweets.csv', names=['Zeitstempel','Waehrung','AnzahlTweets']) 
prices=pd.read_csv('prices.csv', names=['Zeitstempel','Waehrung','Kurs','Volumen']) 
tweets1 = tweets.dropna(axis=1)
merged = prices.merge(tweets1, on='Zeitstempel')
del merged['Waehrung_y']
merged=merged.rename(columns={'Waehrung_x':'Waehrung'})
# Filter currency
mergedf=merged[(merged.Waehrung == 'BellaCoin')]

tscv = TimeSeriesSplit(n_splits=10)
print(tscv)  

X = mergedf['AnzahlTweets']
y = mergedf['Kurs']

X=X.values.reshape(-1,1)
y=y.values.reshape(-1,1)

for train_index, test_index in tscv.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

linreg=LinearRegression()
linreg.fit(X_train,y_train)
y_pred=linreg.predict(X_test)
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

After adding the reshape function in line 23,24 I dont get an error message anymore. So I think my first problem is solved.

If I got this right, the cross validation for time series has to look like this!?

Train: 1 Test: 2
Train: 1,2 Test 3
Train: 1,2,3 Test 4
...

Is the RMSE than calculated as the mean of all the test results and is my approach right then?

Thanks in advance!!!

***zivoni*** · May-05-2017, 11:17 PM

(May-05-2017, 01:24 PM)ulrich48155 Wrote: If I got this right, the cross validation for time series has to look like this!?

Train: 1 Test: 2
Train: 1,2 Test 3
Train: 1,2,3 Test 4

Yes, TimesSeriesSplit works this way - as was mentioned before, your prediction is based only on "past" folds.

(May-05-2017, 01:24 PM)ulrich48155 Wrote: Is the RMSE than calculated as the mean of all the test results

I think that "final" RMSE should be calculated as square root of ( sum of squares of RMSE's for test runs divided by number of folds ).

ulrich48155 · May-06-2017, 09:15 AM

Is my approach right then?

ulrich48155 · May-08-2017, 04:36 PM

I think my problem is that I dont know how to implement the calculation of the RMSE for the different folds. I am splitting my set into 10folds which seemed to work:

Output:TRAIN: [ 0  1  2 ..., 28 29 30] TEST: [31 32 33 ..., 56 57 58]
TRAIN: [ 0  1  2 ..., 56 57 58] TEST: [59 60 61 ..., 84 85 86]
TRAIN: [ 0  1  2 ..., 84 85 86] TEST: [ 87  88  89 ..., 112 113 114]
TRAIN: [  0   1   2 ..., 112 113 114] TEST: [115 116 117 ..., 140 141 142]
TRAIN: [  0   1   2 ..., 140 141 142] TEST: [143 144 145 ..., 168 169 170]
TRAIN: [  0   1   2 ..., 168 169 170] TEST: [171 172 173 ..., 196 197 198]
TRAIN: [  0   1   2 ..., 196 197 198] TEST: [199 200 201 ..., 224 225 226]
TRAIN: [  0   1   2 ..., 224 225 226] TEST: [227 228 229 ..., 252 253 254]
TRAIN: [  0   1   2 ..., 252 253 254] TEST: [255 256 257 ..., 280 281 282]
TRAIN: [  0   1   2 ..., 280 281 282] TEST: [283 284 285 ..., 308 309 310]

When checking y_test and y_pred I think I'm just get the results from one fold:

Output:y_test['0.01141000', '0.01124000', '0.01157000', '0.01215000', '0.01224000', '0.01278000', '0.01201000', '0.01222000', '0.01246000', '0.01231000', '0.01187000', '0.01180000', '0.01197000', '0.01195900', '0.01193326', '0.01187548', '0.01191003', '0.01187855', '0.01190000', '0.01175494', '0.01140000', '0.01176000', '0.01144880', '0.01146409', '0.01199000', '0.01203000', '0.01190000', '0.01233999']

y_pred['0.00976531', '0.00875308', '0.00835700', '0.00941323', '0.00778487', '0.00826898', '0.00879709', '0.00976531', '0.00708071', '0.00628854', '0.00747680', '0.00712472', '0.00686066', '0.00602448', '0.00963328', '0.00712472', '0.00580443', '0.00589245', '0.00642057', '0.00642057', '0.00919318', '0.00848903', '0.00738878', '0.00831299', '0.00976531', '0.00774086', '0.00769685', '0.00721274']

This is why I think my RMSE is only based on one fold and not on 10. Is this right or do I understand this wrong?

Thanks in advance!!!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Fit straight line to pandas time series data with semilog plot	schniefen	2	1,531	Mar-10-2023, 01:08 PM Last Post: jefsummers
	Plot time series data	schniefen	3	1,309	Mar-04-2023, 04:22 PM Last Post: noisefloor
	Help on Time Series problem	Kishore_Bill	1	4,806	Feb-27-2020, 09:07 AM Last Post: Kishore_Bill
	Rookie Stock Prediction Cross Validation using	Graeber	3	2,887	Sep-17-2018, 10:40 PM Last Post: Graeber
	Cross-validation: evaluating estimator performance	Grin	1	2,641	Jun-29-2018, 05:15 AM Last Post: scidam
	help with cross	Item97	27	11,353	Nov-28-2017, 09:18 PM Last Post: Item97
	Visualisation of gaps in time series data	ulrich48155	11	19,254	Jul-04-2017, 11:47 PM Last Post: zivoni

10fold cross-validation on time series

User Panel Messages

Announcements