Nov-07-2019, 03:26 AM
I am working on a small test data. I am getting a ValueError: Found input variables with inconsistent numbers of samples: [5, 6]. How can I make the X and y shapes to be the same size. I added the line;
dataset.dropna(inplace=True)
to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
(5, 9)
(6,)
Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
dataset.dropna(inplace=True)
to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
# Importing Libraries import numpy as np import pandas as pd # Import dataset dataset = pd.read_csv("../output.tsv", delimiter = '\t') # library to clean data import re # Natural Language Tool Kit import nltk nltk.download('stopwords') # to remove stopword from nltk.corpus import stopwords # for Stemming propose from nltk.stem.porter import PorterStemmer # Initialize empty array # to append clean text corpus = [] # 1000 (reviews) rows to clean for i in range(0, 5): # column : "Review", row ith review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # convert all cases to lower cases review = review.lower() # split to array(default delimiter is " ") review = review.split() # creating PorterStemmer object to # take main stem of each word ps = PorterStemmer() # loop for stemming each word # in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # rejoin all string array elements # to create back into a string review = ' '.join(review) # append each string to create # array of clean text corpus.append(review) # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer # To extract max 1500 feature. # "max_features" is attribute to # experiment with to get better results cv = CountVectorizer(max_features = 9) # X contains corpus (dependent variable) X = cv.fit_transform(corpus).toarray() # y contains answers if review # is positive or negative y = dataset.iloc[:, 1].values # Splitting the dataset into # the Training set and Test set from sklearn.model_selection import train_test_split dataset.dropna(inplace=True) print(X.shape) print(y.shape) # experiment with "test_size" # to get better results X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) print(X_train.shape) print(y_train.shape)The Output from the code (for X shape and y shape) is
(5, 9)
(6,)
Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]