Python Forum
Extending my text file word count ranker and calculator
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extending my text file word count ranker and calculator
#8
I’m back. I successfully implemented the feature where the user is prompted to choose 1 of 3 potential books to analyze. I achieved this by adding a function which, as I said yesterday, declares a dictionary, prompts the user to pick one of the options, and then selects the right key value pair using the get built in function which I have assigned to a new variable. This variable I pass into the original main() function at run time. I have it running really well.

As per @steve_shambles suggestion, I looked up stop words and SEO and found a few guides. The nl tool kit is highly recommended around the web. The guide I settled with is titled “Using NLTK To Remove Stopwords From A Text File”. I successfully installed nltk using pip and called the method I need. I attempted to pull the englishlist of stop words as described in the tutorial there. I thought I removed the most commonly used words from the text file. This is the point I’m not sure how to proceed. When I run the script, I get this traceback.

Output:
$ python Daniel_with_ntlk.py Choose from this list of books: 1. Tolstoy 2. Alice 3. Chesterton What is your pick? 1.? 2.? or 3.? >> 3 You picked: Chesterton! A total of 63159 words can be found inside this text file. Traceback (most recent call last): File "Daniel_with_ntlk.py", line 33, in <module> main() File "Daniel_with_ntlk.py", line 30, in main rank_words(clean) File "Daniel_with_ntlk.py", line 18, in rank_words words = re.findall('\w+', clean) File "/usr/lib/python3.7/re.py", line 223, in findall return _compile(pattern, flags).findall(string) TypeError: expected string or bytes-like object $
This output is pointing to line 33 (below if __name__ == '__main__':), line 30 where I call the rank_words(clean) function and line 18 inside this rank_words(clean) function. And then it points to line 223 in my the re library. I’m at a loss here. Could anyone help out here with a more detailed explanation?

Here is my script as it appears now:

from collections import Counter
from nltk.corpus import stopwords
import re
 
def choose_book():
    options = {'1':'Tolstoy.txt','2':'Alice.txt','3':'Chesterton.txt'}
    print("Choose from this list of books: \n 1. Tolstoy \n 2. Alice \n 3. Chesterton")
    pick = input("What is your pick? 1.? 2.? or 3.? >>  ")
    selection = options.get(pick)    
    print(f"You picked: {selection[:-4]}!") # for testing
    return selection

def word_count(text):
    wordlist = text.split()
    print(f"A total of {len(wordlist)} words can be found inside this text file.")
    
def rank_words(clean):
    words = re.findall('\w+', clean)
    top_10 = Counter(words).most_common(50)
    for word,count in top_10:
        print(f'{word:<4} {"-->":^4} {count:>4}')
    
def main():
    selection = choose_book()
    with open(selection) as f:
        text = f.read().lower()
        stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
        clean = [word for word in text.split() if word not in stoplist]
    word_count(text)
    rank_words(clean)
        
if __name__ == '__main__':
    main()
    pass
Edit: Attached is Alice.txt. I attempted to attached Chesterton.txt but it is 356.9kB which is apparently too large for the forum software can handle. Tolstoy.txt is 3.4 MB - ha! I purposely chose War and Peace because it is 500 000 words long and I like to see Python take a long time to process. Here is a link to all 3 txt files on my Dropbox in case any of you would like to try it out.

Attached Files

.txt   Alice.txt (Size: 159.97 KB / Downloads: 152)
Reply


Messages In This Thread
RE: Extending my text file word count ranker and calculator - by Drone4four - Jan-23-2019, 03:19 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Replace a text/word in docx file using Python Devan 4 3,551 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,151 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  Row Count and coloumn count Yegor123 4 1,361 Oct-18-2022, 03:52 AM
Last Post: Yegor123
  For Word, Count in List (Counts.Items()) new_coder_231013 6 2,641 Jul-21-2022, 02:51 PM
Last Post: new_coder_231013
  find some word in text list file and a bit change to them RolanRoll 3 1,567 Jun-27-2022, 01:36 AM
Last Post: RolanRoll
  python-docx regex: replace any word in docx text Tmagpy 4 2,267 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,726 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Converted Pipe Delimited text file to CSV file atomxkai 4 7,047 Feb-11-2022, 12:38 AM
Last Post: atomxkai
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,532 Aug-12-2021, 04:25 PM
Last Post: palladium
  all i want to do is count the lines in each file Skaperen 13 4,897 May-23-2021, 11:24 PM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020