Classification with shuffling

PythonNewbie · (This post was last modified: Nov-11-2017, 05:09 PM by PythonNewbie.)

Hello all,

This is my first post here, and I hope to find some help.

I am trying to reproduce the results of an example (although the example isn't provided in full, so I had to write some parts myself with my limited knowledge in Python), where the file "seeds.tsv" is read by a function and returns data and labels as follows (the function is defined in a separate file called "load.py):

import numpy as np


def load_dataset(dataset_name):
    '''
    data,labels = load_dataset(dataset_name)

    Load a given dataset

    Returns
    -------
    data : numpy ndarray
    labels : list of str
    '''
    data = []
    labels = []
    with open('{0}.tsv'.format(dataset_name)) as ifile:
        for line in ifile:
            tokens = line.strip().split('\t')
            data.append([float(tk) for tk in tokens[:-1]])
            labels.append(tokens[-1])
    data = np.array(data)
    labels = np.array(labels)
    return data, labels

After reading the file, I used the x-fold cross validation for the nearest neighbor algorithm as follows

from load import load_dataset
import numpy as np
import random

feature_neames = ['area',
                  'perimeter',
                  'compactness',
                  'length of kernel',
                  'width of kernel',
                  'asymmetry coefficient',
                  'length of kernel groove']

data, lables = load_dataset('seeds')
"""
rndInx = random.sample(range(len(lables)), len(lables))

data = data[rndInx]
lables = lables[rndInx]

print(lables)
"""
#print(lables.shape)

#This function returns the distance between two points in N-dimensional space
def distance(f1, f2):
    return np.sum((f1 - f2)**2)


#10-fold cross validation

fold = 10 #number of folds and blocks in each fold
elem = int(len(lables)/fold)#number of elements in each block

error = 0.0
for fi in range(fold):
    nearestLable = []
    training = np.ones(len(lables), bool)
    training[fi*fold: fi*fold + elem] = False
    testing = ~ training
    data_tr = data[training]
    data_ts = data[testing]
    labels_tr = lables[training]
    labels_ts = lables[testing]
    for x_ts in data_ts:
        dists = np.array([distance(x_ts, y_tr) for y_tr in data_tr])
    nearest = dists.argmin()
    nearestLable.append(labels_tr[nearest])
    error += np.sum(nearestLable != labels_ts)

print("\n\nThe accuracy of the nearest neighbor"
      " \nclassifier using %i-fold cross "
      "\nvalidation is: %1.2f" %(fold, (1-(error/len(lables)))))

When I ran the above codes without randomizing the data for 10-fold cross validation, I get an accuracy of ~0.86 (it should be 0.88 as reported in the original example!!!), but when I randomize the data by using the random indices rndInx (lines 15-18 in the second code segment), I get an accuracy of 0.38!!!. I am not quite sure why? The original data is ordered in the sense that examples of the same class are placed contagiously. But when I used 70-fold cross validation I get an accuracy of 0.98!!

Am I doing something wrong?

Thanks in advance

PythonNewbie · Nov-12-2017, 10:23 AM

Anyone could comment on this, please? I realize it needs some machine learning background.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Miss-Classification problem with shuffled batch	Faebs94	0	1,575	Sep-02-2021, 11:55 AM Last Post: Faebs94
	Probabilities of binary classification problem	Troublesome1996	0	2,438	Apr-19-2021, 06:40 PM Last Post: Troublesome1996
	GridSearchCV for multi-label classification	mapypy	0	3,730	Mar-29-2021, 01:58 AM Last Post: mapypy
	GNN For Graph "Classification"	BennyS	1	1,799	Feb-09-2021, 12:09 PM Last Post: BennyS
	Help with multiclass classification in perceptron code	Nimo_47	0	3,762	Nov-09-2020, 10:32 PM Last Post: Nimo_47
	Classification and Regression tree (CART)	kumarants	2	2,750	May-26-2020, 11:04 AM Last Post: Larz60+
	Classification of Request	PythonLearner703	8	4,002	Dec-09-2019, 08:56 PM Last Post: micseydel
	CNN Speech Classification	Mitchie87	0	1,634	Dec-06-2019, 06:17 PM Last Post: Mitchie87
	Keras: Time series classification	midarq	0	2,017	Sep-25-2019, 09:03 AM Last Post: midarq

Classification with shuffling

User Panel Messages

Announcements