Python Forum
10fold cross-validation on time series
Thread Rating:
  • 2 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
10fold cross-validation on time series
#1
Hello python experts, 
I'm relatively new to python but have to solve a problem for a university project. I hope you guys can help me. 

My task is to do a 10fold cross-validation on a time series in which 90% should be training data and 10% should be for testing. In the end I should evaluate the testing set with the RMSE. Furhtermore, the data should not be shuffeld, as it is a time series. After doing some research I came up with this:

X = mergedf['AnzahlTweets']
y = mergedf['Kurs']

for train_index, test_index in tscv.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

linreg=LinearRegression()
linreg.fit(X_train,y_train)
y_pred=linreg.predict(X_test)
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Now I got two questions: is this the right approach for my task and if so, why do I get this error 'ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'? It seems like X_train, X_test, y_train, y_test just consists of 'NaN' although X and y have values.

I hope you can help me. Thanks in advance!
Reply
#2
You do not show your input data, so its hard to say what is going wrong. As sklearn works with  numpy arrays, indices from tcsv.split() will be integers in the range(0, len(df)) (assuming that tcsv is instance of sklearn.model_selection.TimeSeriesSplit). If your dataframe has index with different values than 0...len(df)-1, then subsetting with test_index or train_index would lead to NaN for some index values.

You probably want to put model fitting/predicting and RMSE inside your for loop, so you use your folds and train 9? models (now all you do is prediction/evaluation on last fold). And for time series its not exactly 90% train data, 10% test data - as you generally want to avoid prediction based on "future" data, for testing on k-th fold you can use only data from fold 1 to fold k-1. So first train fold is first 10% of data, first test fold is following 10% of data, second train fold is first 20% of data, second test fold is following 10% of data and so on.
Reply
#3
Thanks for your reply! I should've posted the whole code. It looks like this:

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit

# Preparing data
tweets=pd.read_csv('numTweets.csv', names=['Zeitstempel','Waehrung','AnzahlTweets']) 
prices=pd.read_csv('prices.csv', names=['Zeitstempel','Waehrung','Kurs','Volumen']) 
tweets1 = tweets.dropna(axis=1)
merged = prices.merge(tweets1, on='Zeitstempel')
del merged['Waehrung_y']
merged=merged.rename(columns={'Waehrung_x':'Waehrung'})
# Filter currency
mergedf=merged[(merged.Waehrung == 'BellaCoin')]

tscv = TimeSeriesSplit(n_splits=10)
print(tscv)  

X = mergedf['AnzahlTweets']
y = mergedf['Kurs']

X=X.values.reshape(-1,1)
y=y.values.reshape(-1,1)

for train_index, test_index in tscv.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

linreg=LinearRegression()
linreg.fit(X_train,y_train)
y_pred=linreg.predict(X_test)
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
After adding the reshape function in line 23,24 I dont get an error message anymore. So I think my first problem is solved. 

If I got this right, the cross validation for time series has to look like this!?

Train: 1 Test: 2
Train: 1,2 Test 3
Train: 1,2,3 Test 4
...


Is the RMSE than calculated as the mean of all the test results and is my approach right then?

Thanks in advance!!!
Reply
#4
(May-05-2017, 01:24 PM)ulrich48155 Wrote: If I got this right, the cross validation for time series has to look like this!?

Train: 1 Test: 2
Train: 1,2 Test 3
Train: 1,2,3 Test 4
Yes, TimesSeriesSplit works this way - as was mentioned before, your prediction is based only on "past" folds.


(May-05-2017, 01:24 PM)ulrich48155 Wrote: Is the RMSE than calculated as the mean of all the test results
I think that "final" RMSE should be calculated as square root of ( sum of squares of RMSE's for test runs divided by number of folds ).
Reply
#5
Is my approach right then?
Reply
#6
I think my problem is that I dont know how to implement the calculation of the RMSE for the different folds. I am splitting my set into 10folds which seemed to work:
Output:
TRAIN: [ 0  1  2 ..., 28 29 30] TEST: [31 32 33 ..., 56 57 58] TRAIN: [ 0  1  2 ..., 56 57 58] TEST: [59 60 61 ..., 84 85 86] TRAIN: [ 0  1  2 ..., 84 85 86] TEST: [ 87  88  89 ..., 112 113 114] TRAIN: [  0   1   2 ..., 112 113 114] TEST: [115 116 117 ..., 140 141 142] TRAIN: [  0   1   2 ..., 140 141 142] TEST: [143 144 145 ..., 168 169 170] TRAIN: [  0   1   2 ..., 168 169 170] TEST: [171 172 173 ..., 196 197 198] TRAIN: [  0   1   2 ..., 196 197 198] TEST: [199 200 201 ..., 224 225 226] TRAIN: [  0   1   2 ..., 224 225 226] TEST: [227 228 229 ..., 252 253 254] TRAIN: [  0   1   2 ..., 252 253 254] TEST: [255 256 257 ..., 280 281 282] TRAIN: [  0   1   2 ..., 280 281 282] TEST: [283 284 285 ..., 308 309 310]
When checking y_test and y_pred I think I'm just get the results from one fold:
Output:
y_test['0.01141000', '0.01124000', '0.01157000', '0.01215000', '0.01224000', '0.01278000', '0.01201000', '0.01222000', '0.01246000', '0.01231000', '0.01187000', '0.01180000', '0.01197000', '0.01195900', '0.01193326', '0.01187548', '0.01191003', '0.01187855', '0.01190000', '0.01175494', '0.01140000', '0.01176000', '0.01144880', '0.01146409', '0.01199000', '0.01203000', '0.01190000', '0.01233999'] y_pred['0.00976531', '0.00875308', '0.00835700', '0.00941323', '0.00778487', '0.00826898', '0.00879709', '0.00976531', '0.00708071', '0.00628854', '0.00747680', '0.00712472', '0.00686066', '0.00602448', '0.00963328', '0.00712472', '0.00580443', '0.00589245', '0.00642057', '0.00642057', '0.00919318', '0.00848903', '0.00738878', '0.00831299', '0.00976531', '0.00774086', '0.00769685', '0.00721274']
This is why I think my RMSE is only based on one fold and not on 10. Is this right or do I understand this wrong?

Thanks in advance!!!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Fit straight line to pandas time series data with semilog plot schniefen 2 1,531 Mar-10-2023, 01:08 PM
Last Post: jefsummers
  Plot time series data schniefen 3 1,309 Mar-04-2023, 04:22 PM
Last Post: noisefloor
  Help on Time Series problem Kishore_Bill 1 4,806 Feb-27-2020, 09:07 AM
Last Post: Kishore_Bill
  Rookie Stock Prediction Cross Validation using Graeber 3 2,887 Sep-17-2018, 10:40 PM
Last Post: Graeber
  Cross-validation: evaluating estimator performance Grin 1 2,641 Jun-29-2018, 05:15 AM
Last Post: scidam
  help with cross Item97 27 11,353 Nov-28-2017, 09:18 PM
Last Post: Item97
  Visualisation of gaps in time series data ulrich48155 11 19,254 Jul-04-2017, 11:47 PM
Last Post: zivoni

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020