Python Forum
Classification with shuffling
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Classification with shuffling
#1


Hello all,

This is my first post here, and I hope to find some help.

I am trying to reproduce the results of an example (although the example isn't provided in full, so I had to write some parts myself with my limited knowledge in Python), where the file "seeds.tsv" is read by a function and returns data and labels as follows (the function is defined in a separate file called "load.py):

import numpy as np


def load_dataset(dataset_name):
    '''
    data,labels = load_dataset(dataset_name)

    Load a given dataset

    Returns
    -------
    data : numpy ndarray
    labels : list of str
    '''
    data = []
    labels = []
    with open('{0}.tsv'.format(dataset_name)) as ifile:
        for line in ifile:
            tokens = line.strip().split('\t')
            data.append([float(tk) for tk in tokens[:-1]])
            labels.append(tokens[-1])
    data = np.array(data)
    labels = np.array(labels)
    return data, labels
After reading the file, I used the x-fold cross validation for the nearest neighbor algorithm as follows

from load import load_dataset
import numpy as np
import random

feature_neames = ['area',
                  'perimeter',
                  'compactness',
                  'length of kernel',
                  'width of kernel',
                  'asymmetry coefficient',
                  'length of kernel groove']

data, lables = load_dataset('seeds')
"""
rndInx = random.sample(range(len(lables)), len(lables))

data = data[rndInx]
lables = lables[rndInx]

print(lables)
"""
#print(lables.shape)

#This function returns the distance between two points in N-dimensional space
def distance(f1, f2):
    return np.sum((f1 - f2)**2)


#10-fold cross validation

fold = 10 #number of folds and blocks in each fold
elem = int(len(lables)/fold)#number of elements in each block

error = 0.0
for fi in range(fold):
    nearestLable = []
    training = np.ones(len(lables), bool)
    training[fi*fold: fi*fold + elem] = False
    testing = ~ training
    data_tr = data[training]
    data_ts = data[testing]
    labels_tr = lables[training]
    labels_ts = lables[testing]
    for x_ts in data_ts:
        dists = np.array([distance(x_ts, y_tr) for y_tr in data_tr])
    nearest = dists.argmin()
    nearestLable.append(labels_tr[nearest])
    error += np.sum(nearestLable != labels_ts)

print("\n\nThe accuracy of the nearest neighbor"
      " \nclassifier using %i-fold cross "
      "\nvalidation is: %1.2f" %(fold, (1-(error/len(lables)))))
When I ran the above codes without randomizing the data for 10-fold cross validation, I get an accuracy of ~0.86 (it should be 0.88 as reported in the original example!!!), but when I randomize the data by using the random indices rndInx (lines 15-18 in the second code segment), I get an accuracy of 0.38!!!. I am not quite sure why? The original data is ordered in the sense that examples of the same class are placed contagiously. But when I used 70-fold cross validation I get an accuracy of 0.98!!

Am I doing something wrong?

Thanks in advance
Reply
#2
Anyone could comment on this, please? I realize it needs some machine learning background.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Sad Miss-Classification problem with shuffled batch Faebs94 0 1,562 Sep-02-2021, 11:55 AM
Last Post: Faebs94
  Probabilities of binary classification problem Troublesome1996 0 2,426 Apr-19-2021, 06:40 PM
Last Post: Troublesome1996
  GridSearchCV for multi-label classification mapypy 0 3,707 Mar-29-2021, 01:58 AM
Last Post: mapypy
  GNN For Graph "Classification" BennyS 1 1,786 Feb-09-2021, 12:09 PM
Last Post: BennyS
  Help with multiclass classification in perceptron code Nimo_47 0 3,745 Nov-09-2020, 10:32 PM
Last Post: Nimo_47
  Classification and Regression tree (CART) kumarants 2 2,731 May-26-2020, 11:04 AM
Last Post: Larz60+
  Classification of Request PythonLearner703 8 3,970 Dec-09-2019, 08:56 PM
Last Post: micseydel
  CNN Speech Classification Mitchie87 0 1,621 Dec-06-2019, 06:17 PM
Last Post: Mitchie87
  Keras: Time series classification midarq 0 2,002 Sep-25-2019, 09:03 AM
Last Post: midarq

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020