Jan-23-2019, 03:19 AM
(This post was last modified: Jan-23-2019, 03:19 AM by Drone4four.)
I’m back. I successfully implemented the feature where the user is prompted to choose 1 of 3 potential books to analyze. I achieved this by adding a function which, as I said yesterday, declares a dictionary, prompts the user to pick one of the options, and then selects the right key value pair using the
As per @steve_shambles suggestion, I looked up stop words and SEO and found a few guides. The nl tool kit is highly recommended around the web. The guide I settled with is titled “Using NLTK To Remove Stopwords From A Text File”. I successfully installed nltk using pip and called the method I need. I attempted to pull the
Here is my script as it appears now:
get
built in function which I have assigned to a new variable. This variable I pass into the original main() function at run time. I have it running really well.As per @steve_shambles suggestion, I looked up stop words and SEO and found a few guides. The nl tool kit is highly recommended around the web. The guide I settled with is titled “Using NLTK To Remove Stopwords From A Text File”. I successfully installed nltk using pip and called the method I need. I attempted to pull the
english
list of stop words as described in the tutorial there. I thought I removed the most commonly used words from the text file. This is the point I’m not sure how to proceed. When I run the script, I get this traceback.Output:$ python Daniel_with_ntlk.py
Choose from this list of books:
1. Tolstoy
2. Alice
3. Chesterton
What is your pick? 1.? 2.? or 3.? >> 3
You picked: Chesterton!
A total of 63159 words can be found inside this text file.
Traceback (most recent call last):
File "Daniel_with_ntlk.py", line 33, in <module>
main()
File "Daniel_with_ntlk.py", line 30, in main
rank_words(clean)
File "Daniel_with_ntlk.py", line 18, in rank_words
words = re.findall('\w+', clean)
File "/usr/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
$
This output is pointing to line 33 (below if __name__ == '__main__':
), line 30 where I call the rank_words(clean)
function and line 18 inside this rank_words(clean)
function. And then it points to line 223 in my the re
library. I’m at a loss here. Could anyone help out here with a more detailed explanation?Here is my script as it appears now:
from collections import Counter from nltk.corpus import stopwords import re def choose_book(): options = {'1':'Tolstoy.txt','2':'Alice.txt','3':'Chesterton.txt'} print("Choose from this list of books: \n 1. Tolstoy \n 2. Alice \n 3. Chesterton") pick = input("What is your pick? 1.? 2.? or 3.? >> ") selection = options.get(pick) print(f"You picked: {selection[:-4]}!") # for testing return selection def word_count(text): wordlist = text.split() print(f"A total of {len(wordlist)} words can be found inside this text file.") def rank_words(clean): words = re.findall('\w+', clean) top_10 = Counter(words).most_common(50) for word,count in top_10: print(f'{word:<4} {"-->":^4} {count:>4}') def main(): selection = choose_book() with open(selection) as f: text = f.read().lower() stoplist = stopwords.words('english') # Bring in the default English NLTK stop words clean = [word for word in text.split() if word not in stoplist] word_count(text) rank_words(clean) if __name__ == '__main__': main() passEdit: Attached is Alice.txt. I attempted to attached Chesterton.txt but it is 356.9kB which is apparently too large for the forum software can handle. Tolstoy.txt is 3.4 MB - ha! I purposely chose War and Peace because it is 500 000 words long and I like to see Python take a long time to process. Here is a link to all 3 txt files on my Dropbox in case any of you would like to try it out.
Attached Files
Alice.txt (Size: 159.97 KB / Downloads: 152)