KeyError -read multiple lines

bongielondy · Nov-04-2019, 09:19 PM

I am new to Python. An example reviews code has single line reviews and runs well. Mine has multiple lines. I converted the csv file to tsv. The reviews file has 2 columns, Review and Liked. Liked contains 0 or 1, for 'not liked' or 'liked'. This is for natural language processing.

# Importing Libraries 
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv("../AfricanPride_b.txt", delimiter = '\t', error_bad_lines = False)

# library to clean data 
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords 
  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
  
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 1000 (reviews) rows to clean 
for i in range(0, 5000):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review)

This results in a KeyError.
KeyError Traceback (most recent call last)
<ipython-input-8-0f0b9d7dcfd5> in <module>
21
22 # column : "Review", row ith
---> 23 review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
24
25 # convert all cases to lower cases

The rest of the code is

 # Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Splitting the dataset into 
# the Training set and Test set 
from sklearn.model_selection import train_test_split 
  
# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 

# Fitting Random Forest Classification 
# to the Training set 
from sklearn.ensemble import RandomForestClassifier 
  
# n_estimators can be said as number of 
# trees, experiment with n_estimators 
# to get better results  
model = RandomForestClassifier(n_estimators = 501, 
                            criterion = 'entropy') 
                              
model.fit(X_train, y_train)

# Predicting the Test set results 
y_pred = model.predict(X_test) 
  
y_pred 

# Making the Confusion Matrix 
from sklearn.metrics import confusion_matrix 
  
cm = confusion_matrix(y_test, y_pred) 
  
cm

MckJohan · Nov-04-2019, 11:37 PM

can you investigate below code. especially the value of ps, and the content on it. probably add some couple of print statement. KeyError should not be difficult to find out.

    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
       
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]

you can try by adding

try:
    code here
except KeyError as e:
    print something here

bongielondy · Nov-06-2019, 01:33 AM

Thank you. I have updated the code to

for i in range(0, 1000):  
     
        
    # column : "Review", row ith
    try:
        review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
        # convert all cases to lower cases 
        review = review.lower()  
      
        # split to array(default delimiter is " ") 
        review = review.split()  
      
        # creating PorterStemmer object to 
        # take main stem of each word 
        ps = PorterStemmer()  
      
        # loop for stemming each word 
        # in string array at ith row     
        review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
        # rejoin all string array elements 
        # to create back into a string 
        review = ' '.join(review)   
      
        # append each string to create 
        # array of clean text  
        corpus.append(review)
    except KeyError as e:
        print(ps.stem(review))

I seem to get the numerous lines of the same review. I will lookf at the source file again and give feedback. The output is;

wife visit johannesburg famili function stay locat citi never felt comfort secur stay african pride melros arch mani restaur importantli servic attent staff african pride provid except cannot rave enough stay accommod would use citi world class
wife visit johannesburg famili function stay locat citi never felt comfort secur stay african pride melros arch mani restaur importantli servic attent staff african pride provid except cannot rave enough stay accommod would use citi world class

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to write the condition for deleting multiple lines?	Lky	3	1,252	Jul-10-2022, 02:28 PM Last Post: Lky
	Delete multiple lines from txt file	Lky	6	2,480	Jul-10-2022, 12:09 PM Last Post: jefsummers
	Display table field on multiple lines, 'wordwrap'	3python	0	1,857	Aug-06-2021, 08:17 PM Last Post: 3python
	Open and read multiple text files and match words	kozaizsvemira	3	6,900	Jul-07-2021, 11:27 AM Last Post: Larz60+
	[Solved] Trying to read specific lines from a file	Laplace12	7	3,721	Jun-21-2021, 11:15 AM Last Post: Laplace12
	pulling multiple lines from a txt	IceJJFish69	3	2,695	Apr-26-2021, 05:56 PM Last Post: snippsat
	Iterate 2 large text files across lines and replace lines in second file	medatib531	13	6,223	Aug-10-2020, 11:01 PM Last Post: medatib531
	Python: Automated Script to Read Multiple Files in Respective Matrices	Robotguy	7	4,421	Jul-03-2020, 01:34 AM Last Post: bowlofred
	Read CSV error: python KeyError: 'Time'	charlicruz	1	5,336	Jun-27-2020, 09:56 AM Last Post: charlicruz
	Read Multiples Text Files get specific lines based criteria	zinho	5	3,284	May-19-2020, 12:30 PM Last Post: zinho

KeyError -read multiple lines

User Panel Messages

Announcements