Python Forum
ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
#1
I am working on a small test data. I am getting a ValueError: Found input variables with inconsistent numbers of samples: [5, 6]. How can I make the X and y shapes to be the same size. I added the line;

dataset.dropna(inplace=True)

to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
# Importing Libraries 
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv("../output.tsv", delimiter = '\t')


# library to clean data 
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords 
  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
  
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 1000 (reviews) rows to clean 
for i in range(0, 5):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 
    
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review)

# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 9)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Splitting the dataset into 
# the Training set and Test set 
from sklearn.model_selection import train_test_split


dataset.dropna(inplace=True)
print(X.shape)
print(y.shape)


# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
print(X_train.shape)
print(y_train.shape)
The Output from the code (for X shape and y shape) is
(5, 9)
(6,)

Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Reply
#2
Please, always show complete, unmodified error traceback
It contains valuable debugging information.
Reply
#3
The full output is;

Error:
(5, 9) (6,) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-65-67c92addcc9a> in <module> 82 # experiment with "test_size" 83 # to get better results ---> 84 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 85 print(X_train.shape) 86 print(y_train.shape) ~\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(*arrays, **options) 2094 raise TypeError("Invalid parameters passed: %s" % str(options)) 2095 -> 2096 arrays = indexable(*arrays) 2097 2098 n_samples = _num_samples(arrays[0]) ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables) 228 else: 229 result.append(np.array(X)) --> 230 check_consistent_length(*result) 231 return result 232 ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays) 203 if len(uniques) > 1: 204 raise ValueError("Found input variables with inconsistent numbers of" --> 205 " samples: %r" % [int(l) for l in lengths]) 206 207 ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Reply
#4
If I'm not mistaken your x_train and y_size are different sizes. They have to be equal. Somewhere your output is different than input, working with big data requires all samples to be equal due to the accuracy of tests.
Reply
#5
The range was incorrect. The file had 6 reviews but the code was;

for i in range(0, 5): 
I have corrected that and code works fine
Reply
#6
(Nov-07-2019, 03:26 AM)bongielondy Wrote: I am working on a small test data. I am getting a ValueError: Found input variables with inconsistent numbers of samples: [5, 6]. How can I make the X and y shapes to be the same size. I added the line;

dataset.dropna(inplace=True)

to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
# Importing Libraries 
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv("../output.tsv", delimiter = '\t')


# library to clean data 
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords 
  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
  
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 1000 (reviews) rows to clean 
for i in range(0, 5):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 
    
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review)

# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 9)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Splitting the dataset into 
# the Training set and Test set 
from sklearn.model_selection import train_test_split


dataset.dropna(inplace=True)
print(X.shape)
print(y.shape)


# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
print(X_train.shape)
print(y_train.shape)
The Output from the code (for X shape and y shape) is
(5, 9)
(6,)

Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Reply
#7
I faced a similar problem while fitting a regression model . The problem in my case was, Number of rows in X was not equal to number of rows in y. You likely get problems because you remove rows containing nulls in X_train and y_train independent of each other. y_train probably has few, or no nulls and X_train probably has some. So when you remove a row in X_train and the same row is not removed in y_train it will cause your data to be unsynced and have different lenghts. Instead you should remove nulls before you separate X and y.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Inconsistent sorting with the .sort_values() function devansing 4 1,523 Jun-28-2022, 06:12 PM
Last Post: deanhystad
  Separating unique, stable, samples using pandas keithpfio 1 1,073 Jun-20-2022, 07:06 PM
Last Post: keithpfio
  ValueError: Found array with 0 samples marcellam 1 5,078 Apr-22-2020, 04:12 PM
Last Post: jefsummers
  ValueError: Found input variables with inconsistent numbers of sample robert2joe 0 4,211 Mar-25-2020, 11:10 AM
Last Post: robert2joe
  ValueError: Found input variables AhmadMWaddah 3 3,665 Mar-03-2020, 10:19 PM
Last Post: AhmadMWaddah
  ValueError: Input contains infinity or a value too large for dtype('float64') Rabah_r 1 12,840 Apr-06-2019, 11:08 AM
Last Post: scidam
  ValueError: could not broadcast input array from shape (75) into shape (25) route2sabya 0 6,441 Mar-14-2019, 01:14 PM
Last Post: route2sabya
  ValueError: Found input variables with inconsistent numbers of samples: [0, 3] ayaz786amd 2 9,564 Nov-27-2018, 07:12 AM
Last Post: ayaz786amd
  pandas: assemble data to have samples sdcompanies 2 3,265 Jan-19-2018, 09:45 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020