Cross-validation: evaluating estimator performance

Grin · (This post was last modified: Jun-28-2018, 01:52 AM by Larz60+.)

I'm trying to understand Cross-validation.

Very simply if I want to test train split I load the following yet theres no mention of importing a dataset ?

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> from sklearn import datasets
>>> from sklearn import svm

So if I have a large dataset in CSV format how do I import it and split it, I don't understand ((150, 4), (150,)) in the below are these rows or columns.

>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))

**scidam** · Jun-29-2018, 05:15 AM

(Jun-27-2018, 11:24 PM)Grin Wrote: I don't understand ((150, 4), (150,)) in the below are these rows or columns.

150 is the number of rows; When you execute iris = datasets.load_iris() the variable iris becomes the instance of internally defined in sklearn Bunch-class. Now, iris has a few attributes, such as iris.data and iris.target. iris.data -- is a numpy array of morphometric measurements of irises;
iris.target -- encoded species of irises; iris.target_name -- irises species; Another useful attribute is DESCR; you can execute print(iris.DESCR) to see full description regarding the irises dataset.

The purpose of train_test_split(*arrays, ...) function is just splitting arrays passed to it according to some rules. Each array is splitted into two disjoint sets. As a particular, such splitting is needed when estimating performance of classification algorithms. If you have, for example, some classification method, and try to estimate its accuracy, by applying it to the source dataset ( the dataset on which that method was trained), this likely will lead to overestimating (obtained estimations will be incredibly good) the accuracy of the method. This is because the method already knows (already trained with) this dataset; To get more reliable estimations of the classifier's accuracy, one can take a piece of data, that will not be used in training process. When the classifier is trained, we can use unknown to it piece of the data for estimating its accuracy.

You can use train_test_split to split any arrays passed to it.

import numpy as np
from skelarn.model_selection import train_test_split
y = np.arange(10, 20)
X, X = np.meshgrid(np.arange(10), np.arange(10))
train_X, test_X, train_y, test_y = train_test_split(X, y)
#you can pass another array with the same number of rows/items 
z = np.arange(30, 40)
train_X, test_X, train_y, test_y, train_z, test_z = train_test_split(X, y, z)
#Note: all arrays are splitted randomly, but in the same way!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Using ID3 Estimator for Decision Trees	student6306	2	1,429	Jun-15-2023, 04:11 PM Last Post: student6306
	Evaluating math from left to right in coding	hannahlynn	3	2,489	Mar-23-2021, 09:31 PM Last Post: deanhystad
	Rookie Stock Prediction Cross Validation using	Graeber	3	2,911	Sep-17-2018, 10:40 PM Last Post: Graeber
	help with cross	Item97	27	11,519	Nov-28-2017, 09:18 PM Last Post: Item97
	10fold cross-validation on time series	ulrich48155	5	9,239	May-08-2017, 04:36 PM Last Post: ulrich48155

Cross-validation: evaluating estimator performance

User Panel Messages

Announcements