Python Forum

For a project I needed to calculate and display a 10fold cross-validation on a time-series. After plotting my results look like this:

http://imgur.com/a/15wbF

As you can see, both plots also contain the first fold, which I circled green. This fold is not noteworthy and I would like to remove it. Due to the fact, that I work with time series data my 10fold cross-validation has this structure:

Train 0 - Test 1
Train 1 - Test 2
Train 1,2 - Test 3
Train 1,2,3 - Tet 4
...
Train 1,2,3,4,5,6,7,8,9 - Test 10

My code looks like this:

tscv = TimeSeriesSplit(n_splits=10

X = mergedf['AnzahlTweets']
y = mergedf['Kurs']

X=X.values.reshape(-1,1)
y=y.values.reshape(-1,1)

# Cross-validation
linreg=LinearRegression()
rmse=[]
prediction=np.zeros(y.shape)
for train_index, test_index in tscv.split(X):
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   linreg.fit(X_train,y_train)
   y_pred=linreg.predict(X_test)
   prediction[test_index]=y_pred
   rmse.append(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  
   print('RMSE: %.10f' % np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# Plotting
fig, axes = pl.subplots()
pl.plot(y,label='Actual')
pl.plot(prediction, color='red',label='Predicted',)
pl.ylabel('Price')
pl.xlabel('Fold')
pl.gca().xaxis.grid(True)
pl.setp(axes, xticks=[51,98,145,192,239,286,333,380,427,474,521], xticklabels=['          1','          2','          3', '          4','          5','          6','          7','          8','          9','          10'])
pl.legend()
pl.show()

prediction = prediction[:,0]
y = y[:,0]

m, b = np.polyfit(prediction, y, 1)

plrange=np.arange(0,0.000001,0.00000005)

pl.plot(prediction, y,'ro')
pl.plot(prediction, m*prediction + b)
pl.xlabel('Predicted')
pl.ylabel('Actual')
pl.xlim()
pl.gca().xaxis.grid(True)
pl.show()

Now my question: Is it possible to remove the first fold (Train 0 - Test 1) before plotting?

Thanks in advance!

Remove first elements from prediction (and y) with slicing when plotting. You can get length of first split either by directly computing it with

skip_size = len(X) - 10 * (len(X) // (10 + 1))   # for n_splits=10

or by using tcsv.split again (or you could do it in your for loop first iteration ...)

skip_size = len(next(tscv.split(X)[0]))

After that its just

pl.plot(y[skip_size:])
...
pl.plot(prediction[skip_size:], y[skip_size:], 'ro')

Your plot is not piecewise linear, so it seems that your time series is not a time series (= data points in time order).

Thank you very much!

Is there a difference between the first and the second approach? When I try to implement the second one into my loop I get this message:

Error:
TypeError: 'generator' object is not subscriptable

Sorry for late reply, there was misplaced ), it should be:

skip_size = len(next(tscv.split(X))[0])

tcsv.split(X) returns generator object; calling next on it returns tuple of arrays containing indices of first train and test split. We want size of first train split, so [0] is used to extract train split.

ulrich48155

zivoni

ulrich48155

zivoni