Python Forum
Extending my text file word count ranker and calculator
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extending my text file word count ranker and calculator
#1
I am playing with large plain text files like Alice and Wonderland and trying to rank most commonly occurring words. Naturally, you’d expect to encounter many instances of “the”, “and”, “a”.

With a little help from @snippsat in my previous thread, check out this script we were working with:

from collections import Counter
import re
  
with open('Alice.txt') as f:
    text = f.read().lower()
  
words = re.findall('\w+', text)
top_10 = Counter(words).most_common(10)
for word,count in top_10:
    print(f'{word:<4} {"-->":^4} {count:>4}')
Here is the smooth output:
Output:
$ python with_word_count.py the --> 1818 and --> 940 to --> 809 a --> 690 of --> 631 it --> 610 she --> 553 i --> 543 you --> 481 said --> 462
It works really well. I have already extended it by adding a feature which provides the total word count. Here is the code I added:

 
wordlist = text.split()
print("A total of " + str(len(wordlist)) + " words can be found inside this text file.") 
I have now set out to extend the features of this script further. At this point right now I am just trying to re-organize and consolidate these operations into separate functions. The script looks alittle different. Here it is:

from collections import Counter
import re
 
def word_count(text):
    wordlist = text.split()
    print("A total of " + str(len(wordlist)) + " words can be found inside this text file.")

def rank_words():
    words = re.findall('\w+', text)
    top_10 = Counter(words).most_common(10)
    for word,count in top_10:
        print(f'{word:<4} {"-->":^4} {count:>4}')

def main():
    with open('Alice.txt') as f:
        text = f.read().lower()
        return text
        
if __name__ == '__main__':
    main()
    word_count(text)
    rank_words()
    pass 
Here is the output:
Output:
$ python with_word_count.py Traceback (most recent call last): File "with_word_count.py", line 21, in <module> word_count(text) NameError: name 'text' is not defined
The NameError points to the variable text which “isn’t defined”. The issue indicated here is when the variable text is referred to at line 21 when the word_count() function is called. But text is defined in main() which is the first function that I call at code execution as specified below my: if __name__ == '__main__':. If any of you are wondering why I chose to organize my script this way, I am following @ichabod801 example in another recent thread I was working on here. When word_count() is called, text should have already been returned in the previously called function, main(), right?

Would anyone care to elaborate on what the Python interpreter is saying in this traceback? What am I missing? What would I need fix in my script for it to run properly as intended?

Attached is the public domain text file I am working with.

Attached Files

.txt   Alice.txt (Size: 159.97 KB / Downloads: 432)
Reply
#2
Because text is instantiated inside of main(), it is a local variable. Even though text is returned, there is no value to store it (i.e. x = main()) so it ceases to exist once main() terminates.

The purpose of main() is to run all the code required for the program. So, lines 21 and 22 should be in main(). A main() function is the one function that should not have a return.
Reply
#3
f-string can be used in all cases,then avoid rather ugly string like this.
print("A total of " + str(len(wordlist)) + " words can be found inside this text file.")
To make the code work,and look into that function can take arguments.
# word_count.py
from collections import Counter
import re

def word_count(text):
    wordlist = text.split()
    print(f"A total of {len(wordlist)} words can be found inside this text file.")

def rank_words(text):
    words = re.findall('\w+', text)
    top_10 = Counter(words).most_common(10)
    for word,count in top_10:
        print(f'{word:<4} {"-->":^4} {count:>4}')

def file_read():
    with open('alice.txt') as f:
        text = f.read().lower()
        return text

if __name__ == '__main__':
    text = file_read()
    word_count(text)
    rank_words(text)
Can now give a demo on how if __name__ == '__main__': work.
Running word_count.py script now.
Output:
A total of 29465 words can be found inside this text file. the --> 1818 and --> 940 to --> 809 a --> 690 of --> 631 it --> 610 she --> 553 i --> 543 you --> 481 said --> 462
Now will import the script.
>>> import word_count
>>> # Nothing happens
This is what we want when import code as module.
We don't want it run at import.
Use it like this:
>>> import word_count

>>> text = word_count.file_read()

>>> word_count.word_count(text)
A total of 29465 words can be found inside this text file.

>>> word_count.rank_words(text)
the  -->  1818
and  -->   940
to   -->   809
a    -->   690
of   -->   631
it   -->   610
she  -->   553
i    -->   543
you  -->   481
said -->   462
Now if remove if __name__ == '__main__':.
The code run when import it,in almost all cases this not wanted.
>>> import word_count
A total of 29465 words can be found inside this text file.
the  -->  1818
and  -->   940
to   -->   809
a    -->   690
of   -->   631
it   -->   610
she  -->   553
i    -->   543
you  -->   481
said -->   462
Reply
#4
Thank you @stullis for the clarification. I’ve moved lines 21 and 22 into my main() function. It runs beautifully as expected. I’ll keep in mind for next time to run all the required code in my program inside main() and I’ll avoid using a return operation there as well.

For what it’s worth, here is the updated working script:

from collections import Counter
import re
 
def word_count(text):
    wordlist = text.split()
    print("A total of " + str(len(wordlist)) + " words can be found inside this text file.")

def rank_words(text):
    words = re.findall('\w+', text)
    top_10 = Counter(words).most_common(10)
    for word,count in top_10:
        print(f'{word:<4} {"-->":^4} {count:>4}')

def main():
    with open('Alice.txt') as f:
        text = f.read().lower()
    word_count(text)
    rank_words(text)
        
if __name__ == '__main__':
    main()
    pass
And the output:
Output:
$ python with_word_count.py A total of 29465 words can be found inside this text file. the --> 1818 and --> 940 to --> 809 a --> 690 of --> 631 it --> 610 she --> 553 i --> 543 you --> 481 said --> 462
@snippsat: You are right that f-string formatting is more readable and more concise. It’s cleaner and less ugly. When I wrote that string concatenation line initially, I just used whatever was most obvious. On my lunch break tomorrow I’ll read a tutorial I found called “Python 3's f-Strings: An Improved String Formatting Syntax (Guide)

@snippsat: I like the polish and I appreciate the detailed explanation about how you need if __name__ == '__main__': if you are importing a script as a module in the interpreter on the fly. I’ve looked up if __name__ == '__main__': on Google and there are many tutorials and guides. The most upvoted question and subsequent answers on SO explaining and describing this mechanism in Python I find to be overly confusing right now. I suppose once I learn more about classes and with more general experience with Python, the logic and syntax of if __name__ == '__main__': will make more sense. @snippsat: Compared to Sackoverflow, your explanation is easier to understand because it's in the context of my script. Thank you.
Reply
#5
A class example,here is not filename or most_common Hard-coded in code.
Now could this be changed in the function version to,but this is a class example how to do it
from collections import Counter
import re

class WordCounter:
    def file_read(self, file_name):
        with open(file_name) as f:
            self.text = f.read().lower()

    @property
    def word_count(self):
        wordlist = self.text.split()
        print(f"A total of {len(wordlist)} words can be found inside this text file.")

    def rank_words(self, count_amount: int) -> int:
        '''Count most common word in a text'''
        words = re.findall('\w+', self.text)
        top_10 = Counter(words).most_common(count_amount)
        for word,count in top_10:
            print(f'{word:<4} {"-->":^4} {count:>4}')
Use class:
>>> text = WordCounter()

>>> text.file_read('Alice.txt')
>>> text.word_count
A total of 29465 words can be found inside this text file.

>>> text.rank_words(5)
the  -->  1818
and  -->   940
to   -->   809
a    -->   690
of   -->   631
Reply
#6
@snippsat: A thousand thank yous! Reorganizing my script into a Class structure is even more instructive than before. When I read your modified script carefully line-by-line, I understand perhaps 80% of it. It will take me a little more practice with Python before I can begin forming my own classes with the same ease, flow and elegance that you have demonstrated, @snippsat. In the future I am going to refer back to this thread often because this is such a great example of how to write and use classes. I am grateful for your contribution to this discussion so far. Thank you, my friend. :)

Rather than starting a fresh forum thread, I am going to use this one to explore extending my script even further. Up to now the operations have been organized into functions (and a class). Now I am setting out to prompt the user to choose from a selection of 3 public domain .txt files. I’ve already created a new function which declares a dictionary, prompts the user for a choice and then select the right key:value pair using the get built-in . Next I need to call the .txt files properly on my unix filesystem. But I’ve run out of time for tonight. I will be back soon to finish what I started.
Reply
#7
Drone I like your script it is coming on nicely,
can I make a suggestion (for a future feature)
that you may not be aware of, Stop words.

If you can somehow filter out stop words from your
ranked keywords it would be a much more useful output.

Here is a short list of stop words and what they are about:
https://kb.yoast.com/kb/list-stop-words/

It's just an idea, up to you.
Reply
#8
I’m back. I successfully implemented the feature where the user is prompted to choose 1 of 3 potential books to analyze. I achieved this by adding a function which, as I said yesterday, declares a dictionary, prompts the user to pick one of the options, and then selects the right key value pair using the get built in function which I have assigned to a new variable. This variable I pass into the original main() function at run time. I have it running really well.

As per @steve_shambles suggestion, I looked up stop words and SEO and found a few guides. The nl tool kit is highly recommended around the web. The guide I settled with is titled “Using NLTK To Remove Stopwords From A Text File”. I successfully installed nltk using pip and called the method I need. I attempted to pull the englishlist of stop words as described in the tutorial there. I thought I removed the most commonly used words from the text file. This is the point I’m not sure how to proceed. When I run the script, I get this traceback.

Output:
$ python Daniel_with_ntlk.py Choose from this list of books: 1. Tolstoy 2. Alice 3. Chesterton What is your pick? 1.? 2.? or 3.? >> 3 You picked: Chesterton! A total of 63159 words can be found inside this text file. Traceback (most recent call last): File "Daniel_with_ntlk.py", line 33, in <module> main() File "Daniel_with_ntlk.py", line 30, in main rank_words(clean) File "Daniel_with_ntlk.py", line 18, in rank_words words = re.findall('\w+', clean) File "/usr/lib/python3.7/re.py", line 223, in findall return _compile(pattern, flags).findall(string) TypeError: expected string or bytes-like object $
This output is pointing to line 33 (below if __name__ == '__main__':), line 30 where I call the rank_words(clean) function and line 18 inside this rank_words(clean) function. And then it points to line 223 in my the re library. I’m at a loss here. Could anyone help out here with a more detailed explanation?

Here is my script as it appears now:

from collections import Counter
from nltk.corpus import stopwords
import re
 
def choose_book():
    options = {'1':'Tolstoy.txt','2':'Alice.txt','3':'Chesterton.txt'}
    print("Choose from this list of books: \n 1. Tolstoy \n 2. Alice \n 3. Chesterton")
    pick = input("What is your pick? 1.? 2.? or 3.? >>  ")
    selection = options.get(pick)    
    print(f"You picked: {selection[:-4]}!") # for testing
    return selection

def word_count(text):
    wordlist = text.split()
    print(f"A total of {len(wordlist)} words can be found inside this text file.")
    
def rank_words(clean):
    words = re.findall('\w+', clean)
    top_10 = Counter(words).most_common(50)
    for word,count in top_10:
        print(f'{word:<4} {"-->":^4} {count:>4}')
    
def main():
    selection = choose_book()
    with open(selection) as f:
        text = f.read().lower()
        stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
        clean = [word for word in text.split() if word not in stoplist]
    word_count(text)
    rank_words(clean)
        
if __name__ == '__main__':
    main()
    pass
Edit: Attached is Alice.txt. I attempted to attached Chesterton.txt but it is 356.9kB which is apparently too large for the forum software can handle. Tolstoy.txt is 3.4 MB - ha! I purposely chose War and Peace because it is 500 000 words long and I like to see Python take a long time to process. Here is a link to all 3 txt files on my Dropbox in case any of you would like to try it out.

Attached Files

.txt   Alice.txt (Size: 159.97 KB / Downloads: 146)
Reply
#9
Nice try Drone, I installed nltk and tried to run
your script with alice.txt in same dir got:

Choose from this list of books:
1. Tolstoy
2. Alice
3. Chesterton
What is your pick? 1.? 2.? or 3.? >> Searched in:2
Traceback (most recent call last):
File "C:/Python365/alice.py", line 35, in <module>
main()
File "C:/Python365/alice.py", line 26, in main
selection = choose_book()
File "C:/Python365/alice.py", line 12, in choose_book
print(f"You picked: {selection[:-4]}!") # for testing
TypeError: 'NoneType' object is not subscriptable.

I don't have a clue how to help, sorry.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Replace a text/word in docx file using Python Devan 4 2,845 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,061 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  Row Count and coloumn count Yegor123 4 1,268 Oct-18-2022, 03:52 AM
Last Post: Yegor123
  For Word, Count in List (Counts.Items()) new_coder_231013 6 2,497 Jul-21-2022, 02:51 PM
Last Post: new_coder_231013
  find some word in text list file and a bit change to them RolanRoll 3 1,482 Jun-27-2022, 01:36 AM
Last Post: RolanRoll
  python-docx regex: replace any word in docx text Tmagpy 4 2,139 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,575 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Converted Pipe Delimited text file to CSV file atomxkai 4 6,840 Feb-11-2022, 12:38 AM
Last Post: atomxkai
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,455 Aug-12-2021, 04:25 PM
Last Post: palladium
  all i want to do is count the lines in each file Skaperen 13 4,727 May-23-2021, 11:24 PM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020