May-31-2019, 07:25 PM
(This post was last modified: May-31-2019, 08:05 PM by Drone4four.)
I’ve got a script which @snippsat helped me with previously which ranks the top 10 most commonly used words in a large public domain book such as Alice in Wonderland. The text file is attached to this forum post.
Here is the script as it appears now:
Great!
Now I am trying to extend this script by filtering out all the common words (such as ‘the, ‘and’, ‘to’). I’m using a guide called “Using NLTK To Remove Stopwords From A Text File”. Here is my latest iteration of my script (with new code added at lines 3, 11 and 12):
What is this traceback trying to say?
The list comprehension psychs me out. I’m not sure how to convert it back to a regular nested for loop. I’m not even sure if this is the issue. Here is my attempt at converting what I have at line 12 into a regular nested for loop:
Here is the script as it appears now:
from collections import Counter import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): words = re.findall('\w+', text) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)When I run the script, here is the correct, predictable and expected output:
Quote:$ python script.py
the --> 1818
and --> 940
to --> 809
a --> 690
of --> 631
it --> 610
she --> 553
i --> 543
you --> 481
said --> 462
Great!
Now I am trying to extend this script by filtering out all the common words (such as ‘the, ‘and’, ‘to’). I’m using a guide called “Using NLTK To Remove Stopwords From A Text File”. Here is my latest iteration of my script (with new code added at lines 3, 11 and 12):
from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') # Bring in the default English NLTK stop words clean = [word for word in text.split() if word not in stoplist] # words = re.findall('\w+', clean) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)Here is my traceback:
Quote: python daniel_script6_with_snippsat.py
Traceback (most recent call last):
File "daniel_script6_with_snippsat.py", line 20, in <module>
main(text)
File "daniel_script6_with_snippsat.py", line 13, in main
words = re.findall('\w+', clean)
File "/usr/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What is this traceback trying to say?
The list comprehension psychs me out. I’m not sure how to convert it back to a regular nested for loop. I’m not even sure if this is the issue. Here is my attempt at converting what I have at line 12 into a regular nested for loop:
clean = for word in text.split(): if word not in stoplist: word +=So I suppose my three questions are:
- What is the traceback trying to say?
- How would you people modify the script to successfully process stopwords so as to eliminate the ranking of common words?
- How would you people re-write the list comprehension loop into a regular nested for loop?
Attached Files