Python Forum
Analyzing large text file with nltk.corpus (stopwords )
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Analyzing large text file with nltk.corpus (stopwords )
#1
I’ve got a script which @snippsat helped me with previously which ranks the top 10 most commonly used words in a large public domain book such as Alice in Wonderland. The text file is attached to this forum post.

Here is the script as it appears now:

from collections import Counter
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   words = re.findall('\w+', text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)
When I run the script, here is the correct, predictable and expected output:

Quote:$ python script.py
the --> 1818
and --> 940
to --> 809
a --> 690
of --> 631
it --> 610
she --> 553
i --> 543
you --> 481
said --> 462

Great!

Now I am trying to extend this script by filtering out all the common words (such as ‘the, ‘and’, ‘to’). I’m using a guide called “Using NLTK To Remove Stopwords From A Text File”. Here is my latest iteration of my script (with new code added at lines 3, 11 and 12):

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   words = re.findall('\w+', clean)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)
Here is my traceback:

Quote: python daniel_script6_with_snippsat.py
Traceback (most recent call last):
File "daniel_script6_with_snippsat.py", line 20, in <module>
main(text)
File "daniel_script6_with_snippsat.py", line 13, in main
words = re.findall('\w+', clean)
File "/usr/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

What is this traceback trying to say?

The list comprehension psychs me out. I’m not sure how to convert it back to a regular nested for loop. I’m not even sure if this is the issue. Here is my attempt at converting what I have at line 12 into a regular nested for loop:

clean = for word in text.split():
       if word not in stoplist:
            word +=
So I suppose my three questions are:
  1. What is the traceback trying to say?
  2. How would you people modify the script to successfully process stopwords so as to eliminate the ranking of common words?
  3. How would you people re-write the list comprehension loop into a regular nested for loop?

Attached Files

.txt   Alice.txt (Size: 159.97 KB / Downloads: 1,863)
Reply


Messages In This Thread
Analyzing large text file with nltk.corpus (stopwords ) - by Drone4four - May-31-2019, 07:25 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Converted EXE file size is too large Rajasekaran 0 1,530 Mar-30-2023, 11:50 AM
Last Post: Rajasekaran
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,140 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  validate large json file with millions of records in batches herobpv 3 1,286 Dec-10-2022, 10:36 PM
Last Post: bowlofred
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,711 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Pyinstaller distribution file seems too large hammer 4 2,747 Mar-31-2022, 02:33 PM
Last Post: snippsat
  Initializing, reading and updating a large JSON file medatib531 0 1,801 Mar-10-2022, 07:58 PM
Last Post: medatib531
  Converted Pipe Delimited text file to CSV file atomxkai 4 7,025 Feb-11-2022, 12:38 AM
Last Post: atomxkai
  Help with simple nltk Chatbot Extra 3 1,900 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  [split] How to convert the CSV text file into a txt file Pinto94 5 3,377 Dec-23-2020, 08:04 AM
Last Post: ndc85430
  Saving a download of stopwords (nltk) Drone4four 1 9,351 Nov-19-2020, 11:50 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020