Python Forum
Cross-validation: evaluating estimator performance
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cross-validation: evaluating estimator performance
#1
I'm trying to understand Cross-validation.

Very simply if I want to test train split I load the following yet theres no mention of importing a dataset ?

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> from sklearn import datasets
>>> from sklearn import svm
So if I have a large dataset in CSV format how do I import it and split it, I don't understand ((150, 4), (150,)) in the below are these rows or columns.
>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))
Reply
#2
(Jun-27-2018, 11:24 PM)Grin Wrote: I don't understand ((150, 4), (150,)) in the below are these rows or columns.

150 is the number of rows; When you execute iris = datasets.load_iris() the variable iris becomes the instance of internally defined in sklearn Bunch-class. Now, iris has a few attributes, such as iris.data and iris.target. iris.data -- is a numpy array of morphometric measurements of irises;
iris.target -- encoded species of irises; iris.target_name -- irises species; Another useful attribute is DESCR; you can execute print(iris.DESCR) to see full description regarding the irises dataset.

The purpose of train_test_split(*arrays, ...) function is just splitting arrays passed to it according to some rules. Each array is splitted into two disjoint sets. As a particular, such splitting is needed when estimating performance of classification algorithms. If you have, for example, some classification method, and try to estimate its accuracy, by applying it to the source dataset ( the dataset on which that method was trained), this likely will lead to overestimating (obtained estimations will be incredibly good) the accuracy of the method. This is because the method already knows (already trained with) this dataset; To get more reliable estimations of the classifier's accuracy, one can take a piece of data, that will not be used in training process. When the classifier is trained, we can use unknown to it piece of the data for estimating its accuracy.

You can use train_test_split to split any arrays passed to it.

import numpy as np
from skelarn.model_selection import train_test_split
y = np.arange(10, 20)
X, X = np.meshgrid(np.arange(10), np.arange(10))
train_X, test_X, train_y, test_y = train_test_split(X, y)
#you can pass another array with the same number of rows/items 
z = np.arange(30, 40)
train_X, test_X, train_y, test_y, train_z, test_z = train_test_split(X, y, z)
#Note: all arrays are splitted randomly, but in the same way!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Using ID3 Estimator for Decision Trees student6306 2 1,429 Jun-15-2023, 04:11 PM
Last Post: student6306
  Evaluating math from left to right in coding hannahlynn 3 2,489 Mar-23-2021, 09:31 PM
Last Post: deanhystad
  Rookie Stock Prediction Cross Validation using Graeber 3 2,911 Sep-17-2018, 10:40 PM
Last Post: Graeber
  help with cross Item97 27 11,519 Nov-28-2017, 09:18 PM
Last Post: Item97
  10fold cross-validation on time series ulrich48155 5 9,239 May-08-2017, 04:36 PM
Last Post: ulrich48155

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020