Extending my text file word count ranker and calculator

Drone4four · (This post was last modified: Jan-23-2019, 03:19 AM by Drone4four.)

I’m back. I successfully implemented the feature where the user is prompted to choose 1 of 3 potential books to analyze. I achieved this by adding a function which, as I said yesterday, declares a dictionary, prompts the user to pick one of the options, and then selects the right key value pair using the get built in function which I have assigned to a new variable. This variable I pass into the original main() function at run time. I have it running really well.

As per @steve_shambles suggestion, I looked up stop words and SEO and found a few guides. The nl tool kit is highly recommended around the web. The guide I settled with is titled “Using NLTK To Remove Stopwords From A Text File”. I successfully installed nltk using pip and called the method I need. I attempted to pull the englishlist of stop words as described in the tutorial there. I thought I removed the most commonly used words from the text file. This is the point I’m not sure how to proceed. When I run the script, I get this traceback.

Output:$ python Daniel_with_ntlk.py
Choose from this list of books: 
 1. Tolstoy 
 2. Alice 
 3. Chesterton
What is your pick? 1.? 2.? or 3.? >>  3
You picked: Chesterton!
A total of 63159 words can be found inside this text file.
Traceback (most recent call last):
  File "Daniel_with_ntlk.py", line 33, in <module>
    main()
  File "Daniel_with_ntlk.py", line 30, in main
    rank_words(clean)
  File "Daniel_with_ntlk.py", line 18, in rank_words
    words = re.findall('\w+', clean)
  File "/usr/lib/python3.7/re.py", line 223, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
$

This output is pointing to line 33 (below if __name__ == '__main__':), line 30 where I call the rank_words(clean) function and line 18 inside this rank_words(clean) function. And then it points to line 223 in my the re library. I’m at a loss here. Could anyone help out here with a more detailed explanation?

Here is my script as it appears now:

from collections import Counter
from nltk.corpus import stopwords
import re
 
def choose_book():
    options = {'1':'Tolstoy.txt','2':'Alice.txt','3':'Chesterton.txt'}
    print("Choose from this list of books: \n 1. Tolstoy \n 2. Alice \n 3. Chesterton")
    pick = input("What is your pick? 1.? 2.? or 3.? >>  ")
    selection = options.get(pick)    
    print(f"You picked: {selection[:-4]}!") # for testing
    return selection

def word_count(text):
    wordlist = text.split()
    print(f"A total of {len(wordlist)} words can be found inside this text file.")
    
def rank_words(clean):
    words = re.findall('\w+', clean)
    top_10 = Counter(words).most_common(50)
    for word,count in top_10:
        print(f'{word:<4} {"-->":^4} {count:>4}')
    
def main():
    selection = choose_book()
    with open(selection) as f:
        text = f.read().lower()
        stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
        clean = [word for word in text.split() if word not in stoplist]
    word_count(text)
    rank_words(clean)
        
if __name__ == '__main__':
    main()
    pass

Edit: Attached is Alice.txt. I attempted to attached Chesterton.txt but it is 356.9kB which is apparently too large for the forum software can handle. Tolstoy.txt is 3.4 MB - ha! I purposely chose War and Peace because it is 500 000 words long and I like to see Python take a long time to process. Here is a link to all 3 txt files on my Dropbox in case any of you would like to try it out.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Replace a text/word in docx file using Python	Devan	4	3,551	Oct-17-2023, 06:03 PM Last Post: Devan
	Need to compare the Excel file name with a directory text file.	veeran1991	1	1,151	Dec-15-2022, 04:32 PM Last Post: Larz60+
	Row Count and coloumn count	Yegor123	4	1,361	Oct-18-2022, 03:52 AM Last Post: Yegor123
	For Word, Count in List (Counts.Items())	new_coder_231013	6	2,641	Jul-21-2022, 02:51 PM Last Post: new_coder_231013
	find some word in text list file and a bit change to them	RolanRoll	3	1,567	Jun-27-2022, 01:36 AM Last Post: RolanRoll
	python-docx regex: replace any word in docx text	Tmagpy	4	2,267	Jun-18-2022, 09:12 AM Last Post: Tmagpy
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	1,726	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Converted Pipe Delimited text file to CSV file	atomxkai	4	7,047	Feb-11-2022, 12:38 AM Last Post: atomxkai
	Problem: Check if a list contains a word and then continue with the next word	Mangono	2	2,532	Aug-12-2021, 04:25 PM Last Post: palladium
	all i want to do is count the lines in each file	Skaperen	13	4,897	May-23-2021, 11:24 PM Last Post: Skaperen

Extending my text file word count ranker and calculator

User Panel Messages

Announcements