Analyzing large text file with nltk.corpus (stopwords )

Drone4four · (This post was last modified: May-31-2019, 08:05 PM by Drone4four.)

I’ve got a script which @snippsat helped me with previously which ranks the top 10 most commonly used words in a large public domain book such as Alice in Wonderland. The text file is attached to this forum post.

Here is the script as it appears now:

from collections import Counter
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   words = re.findall('\w+', text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

When I run the script, here is the correct, predictable and expected output:

Quote:$ python script.py
the --> 1818
and --> 940
to --> 809
a --> 690
of --> 631
it --> 610
she --> 553
i --> 543
you --> 481
said --> 462

Great!

Now I am trying to extend this script by filtering out all the common words (such as ‘the, ‘and’, ‘to’). I’m using a guide called “Using NLTK To Remove Stopwords From A Text File”. Here is my latest iteration of my script (with new code added at lines 3, 11 and 12):

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   words = re.findall('\w+', clean)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is my traceback:

Quote: python daniel_script6_with_snippsat.py
Traceback (most recent call last):
File "daniel_script6_with_snippsat.py", line 20, in <module>
main(text)
File "daniel_script6_with_snippsat.py", line 13, in main
words = re.findall('\w+', clean)
File "/usr/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

What is this traceback trying to say?

The list comprehension psychs me out. I’m not sure how to convert it back to a regular nested for loop. I’m not even sure if this is the issue. Here is my attempt at converting what I have at line 12 into a regular nested for loop:

clean = for word in text.split():
       if word not in stoplist:
            word +=

So I suppose my three questions are:

What is the traceback trying to say?
How would you people modify the script to successfully process stopwords so as to eliminate the ranking of common words?
How would you people re-write the list comprehension loop into a regular nested for loop?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Problems writing a large text file in python	Vilius	4	959	Dec-21-2024, 09:20 AM Last Post: Pedroski55
	get nltk data	Pedroski55	7	5,347	Aug-12-2024, 06:16 AM Last Post: Pedroski55
	speed up getting embedding from bert model for large set of text	veda	7	2,144	May-27-2024, 08:28 AM Last Post: Pedroski55
	Converted EXE file size is too large	Rajasekaran	0	2,579	Mar-30-2023, 11:50 AM Last Post: Rajasekaran
	Need to compare the Excel file name with a directory text file.	veeran1991	1	1,982	Dec-15-2022, 04:32 PM Last Post: Larz60+
	validate large json file with millions of records in batches	herobpv	3	2,144	Dec-10-2022, 10:36 PM Last Post: bowlofred
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	3,101	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Pyinstaller distribution file seems too large	hammer	4	4,796	Mar-31-2022, 02:33 PM Last Post: snippsat
	Initializing, reading and updating a large JSON file	medatib531	0	2,719	Mar-10-2022, 07:58 PM Last Post: medatib531
	Converted Pipe Delimited text file to CSV file	atomxkai	4	10,817	Feb-11-2022, 12:38 AM Last Post: atomxkai

Analyzing large text file with nltk.corpus (stopwords )

User Panel Messages

Announcements