Python Forum
KeyError -read multiple lines
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
KeyError -read multiple lines
#1
I am new to Python. An example reviews code has single line reviews and runs well. Mine has multiple lines. I converted the csv file to tsv. The reviews file has 2 columns, Review and Liked. Liked contains 0 or 1, for 'not liked' or 'liked'. This is for natural language processing.

# Importing Libraries 
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv("../AfricanPride_b.txt", delimiter = '\t', error_bad_lines = False)

# library to clean data 
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords 
  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
  
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 1000 (reviews) rows to clean 
for i in range(0, 5000):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review) 
This results in a KeyError.
KeyError Traceback (most recent call last)
<ipython-input-8-0f0b9d7dcfd5> in <module>
21
22 # column : "Review", row ith
---> 23 review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
24
25 # convert all cases to lower cases

The rest of the code is
 # Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Splitting the dataset into 
# the Training set and Test set 
from sklearn.model_selection import train_test_split 
  
# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 

# Fitting Random Forest Classification 
# to the Training set 
from sklearn.ensemble import RandomForestClassifier 
  
# n_estimators can be said as number of 
# trees, experiment with n_estimators 
# to get better results  
model = RandomForestClassifier(n_estimators = 501, 
                            criterion = 'entropy') 
                              
model.fit(X_train, y_train)

# Predicting the Test set results 
y_pred = model.predict(X_test) 
  
y_pred 

# Making the Confusion Matrix 
from sklearn.metrics import confusion_matrix 
  
cm = confusion_matrix(y_test, y_pred) 
  
cm
Reply
#2
can you investigate below code. especially the value of ps, and the content on it. probably add some couple of print statement. KeyError should not be difficult to find out.

    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
       
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
you can try by adding
try:
    code here
except KeyError as e:
    print something here
Reply
#3
Thank you. I have updated the code to

for i in range(0, 1000):  
     
        
    # column : "Review", row ith
    try:
        review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
        # convert all cases to lower cases 
        review = review.lower()  
      
        # split to array(default delimiter is " ") 
        review = review.split()  
      
        # creating PorterStemmer object to 
        # take main stem of each word 
        ps = PorterStemmer()  
      
        # loop for stemming each word 
        # in string array at ith row     
        review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
        # rejoin all string array elements 
        # to create back into a string 
        review = ' '.join(review)   
      
        # append each string to create 
        # array of clean text  
        corpus.append(review)
    except KeyError as e:
        print(ps.stem(review))
I seem to get the numerous lines of the same review. I will lookf at the source file again and give feedback. The output is;

wife visit johannesburg famili function stay locat citi never felt comfort secur stay african pride melros arch mani restaur importantli servic attent staff african pride provid except cannot rave enough stay accommod would use citi world class
wife visit johannesburg famili function stay locat citi never felt comfort secur stay african pride melros arch mani restaur importantli servic attent staff african pride provid except cannot rave enough stay accommod would use citi world class
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to write the condition for deleting multiple lines? Lky 3 1,099 Jul-10-2022, 02:28 PM
Last Post: Lky
  Delete multiple lines from txt file Lky 6 2,201 Jul-10-2022, 12:09 PM
Last Post: jefsummers
  Display table field on multiple lines, 'wordwrap' 3python 0 1,747 Aug-06-2021, 08:17 PM
Last Post: 3python
  Open and read multiple text files and match words kozaizsvemira 3 6,666 Jul-07-2021, 11:27 AM
Last Post: Larz60+
  [Solved] Trying to read specific lines from a file Laplace12 7 3,473 Jun-21-2021, 11:15 AM
Last Post: Laplace12
  pulling multiple lines from a txt IceJJFish69 3 2,522 Apr-26-2021, 05:56 PM
Last Post: snippsat
  Iterate 2 large text files across lines and replace lines in second file medatib531 13 5,705 Aug-10-2020, 11:01 PM
Last Post: medatib531
  Python: Automated Script to Read Multiple Files in Respective Matrices Robotguy 7 4,119 Jul-03-2020, 01:34 AM
Last Post: bowlofred
  Read CSV error: python KeyError: 'Time' charlicruz 1 5,101 Jun-27-2020, 09:56 AM
Last Post: charlicruz
  Read Multiples Text Files get specific lines based criteria zinho 5 3,050 May-19-2020, 12:30 PM
Last Post: zinho

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020