KeyError -read multiple lines - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: KeyError -read multiple lines (/thread-22232.html) |
KeyError -read multiple lines - bongielondy - Nov-04-2019 I am new to Python. An example reviews code has single line reviews and runs well. Mine has multiple lines. I converted the csv file to tsv. The reviews file has 2 columns, Review and Liked. Liked contains 0 or 1, for 'not liked' or 'liked'. This is for natural language processing. # Importing Libraries import numpy as np import pandas as pd # Import dataset dataset = pd.read_csv("../AfricanPride_b.txt", delimiter = '\t', error_bad_lines = False) # library to clean data import re # Natural Language Tool Kit import nltk nltk.download('stopwords') # to remove stopword from nltk.corpus import stopwords # for Stemming propose from nltk.stem.porter import PorterStemmer # Initialize empty array # to append clean text corpus = [] # 1000 (reviews) rows to clean for i in range(0, 5000): # column : "Review", row ith review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # convert all cases to lower cases review = review.lower() # split to array(default delimiter is " ") review = review.split() # creating PorterStemmer object to # take main stem of each word ps = PorterStemmer() # loop for stemming each word # in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # rejoin all string array elements # to create back into a string review = ' '.join(review) # append each string to create # array of clean text corpus.append(review)This results in a KeyError. KeyError Traceback (most recent call last) <ipython-input-8-0f0b9d7dcfd5> in <module> 21 22 # column : "Review", row ith ---> 23 review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 24 25 # convert all cases to lower cases The rest of the code is # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer # To extract max 1500 feature. # "max_features" is attribute to # experiment with to get better results cv = CountVectorizer(max_features = 1500) # X contains corpus (dependent variable) X = cv.fit_transform(corpus).toarray() # y contains answers if review # is positive or negative y = dataset.iloc[:, 1].values # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer # To extract max 1500 feature. # "max_features" is attribute to # experiment with to get better results cv = CountVectorizer(max_features = 1500) # X contains corpus (dependent variable) X = cv.fit_transform(corpus).toarray() # y contains answers if review # is positive or negative y = dataset.iloc[:, 1].values # Splitting the dataset into # the Training set and Test set from sklearn.model_selection import train_test_split # experiment with "test_size" # to get better results X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) # Fitting Random Forest Classification # to the Training set from sklearn.ensemble import RandomForestClassifier # n_estimators can be said as number of # trees, experiment with n_estimators # to get better results model = RandomForestClassifier(n_estimators = 501, criterion = 'entropy') model.fit(X_train, y_train) # Predicting the Test set results y_pred = model.predict(X_test) y_pred # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) cm RE: KeyError -read multiple lines - MckJohan - Nov-04-2019 can you investigate below code. especially the value of ps, and the content on it. probably add some couple of print statement. KeyError should not be difficult to find out. # creating PorterStemmer object to # take main stem of each word ps = PorterStemmer() # loop for stemming each word # in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]you can try by adding try: code here except KeyError as e: print something here RE: KeyError -read multiple lines - bongielondy - Nov-06-2019 Thank you. I have updated the code to for i in range(0, 1000): # column : "Review", row ith try: review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # convert all cases to lower cases review = review.lower() # split to array(default delimiter is " ") review = review.split() # creating PorterStemmer object to # take main stem of each word ps = PorterStemmer() # loop for stemming each word # in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # rejoin all string array elements # to create back into a string review = ' '.join(review) # append each string to create # array of clean text corpus.append(review) except KeyError as e: print(ps.stem(review))I seem to get the numerous lines of the same review. I will lookf at the source file again and give feedback. The output is; wife visit johannesburg famili function stay locat citi never felt comfort secur stay african pride melros arch mani restaur importantli servic attent staff african pride provid except cannot rave enough stay accommod would use citi world class wife visit johannesburg famili function stay locat citi never felt comfort secur stay african pride melros arch mani restaur importantli servic attent staff african pride provid except cannot rave enough stay accommod would use citi world class |