Nov-11-2017, 05:08 PM
(This post was last modified: Nov-11-2017, 05:09 PM by PythonNewbie.)
Hello all,
This is my first post here, and I hope to find some help.
I am trying to reproduce the results of an example (although the example isn't provided in full, so I had to write some parts myself with my limited knowledge in Python), where the file "seeds.tsv" is read by a function and returns data and labels as follows (the function is defined in a separate file called "load.py):
import numpy as np def load_dataset(dataset_name): ''' data,labels = load_dataset(dataset_name) Load a given dataset Returns ------- data : numpy ndarray labels : list of str ''' data = [] labels = [] with open('{0}.tsv'.format(dataset_name)) as ifile: for line in ifile: tokens = line.strip().split('\t') data.append([float(tk) for tk in tokens[:-1]]) labels.append(tokens[-1]) data = np.array(data) labels = np.array(labels) return data, labelsAfter reading the file, I used the x-fold cross validation for the nearest neighbor algorithm as follows
from load import load_dataset import numpy as np import random feature_neames = ['area', 'perimeter', 'compactness', 'length of kernel', 'width of kernel', 'asymmetry coefficient', 'length of kernel groove'] data, lables = load_dataset('seeds') """ rndInx = random.sample(range(len(lables)), len(lables)) data = data[rndInx] lables = lables[rndInx] print(lables) """ #print(lables.shape) #This function returns the distance between two points in N-dimensional space def distance(f1, f2): return np.sum((f1 - f2)**2) #10-fold cross validation fold = 10 #number of folds and blocks in each fold elem = int(len(lables)/fold)#number of elements in each block error = 0.0 for fi in range(fold): nearestLable = [] training = np.ones(len(lables), bool) training[fi*fold: fi*fold + elem] = False testing = ~ training data_tr = data[training] data_ts = data[testing] labels_tr = lables[training] labels_ts = lables[testing] for x_ts in data_ts: dists = np.array([distance(x_ts, y_tr) for y_tr in data_tr]) nearest = dists.argmin() nearestLable.append(labels_tr[nearest]) error += np.sum(nearestLable != labels_ts) print("\n\nThe accuracy of the nearest neighbor" " \nclassifier using %i-fold cross " "\nvalidation is: %1.2f" %(fold, (1-(error/len(lables)))))When I ran the above codes without randomizing the data for 10-fold cross validation, I get an accuracy of ~0.86 (it should be 0.88 as reported in the original example!!!), but when I randomize the data by using the random indices rndInx (lines 15-18 in the second code segment), I get an accuracy of 0.38!!!. I am not quite sure why? The original data is ordered in the sense that examples of the same class are placed contagiously. But when I used 70-fold cross validation I get an accuracy of 0.98!!
Am I doing something wrong?
Thanks in advance