Python Forum
Python word counter and ranker
Thread Rating:
  • 4 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python word counter and ranker
#1
I’ve got an older Python 2 script from an outdated Udemy course. The script opens any basic raw text file (such as a large public domain novel like Alice and Wonderland), counts all the words and ranks the top 10 most common occurrences. Naturally, you can expect many occurrences of ‘the’, ‘is’, ‘a’.

It runs as expected using the Python 2 interpreter. Attached is Alice and Wonderland in .txt format. Here is the Python 2 script:

#!/usr/bin/env python
# encoding: utf-8
"""
alice_file.py

Created by Jason Elbourne on 2011-12-29.
Copyright (c) 2011 Jason Elbourne. All rights reserved.
"""
import operator

## Get each word - Turn to Lower case (.lower())
## Count Duplicates of words
## Dictionary {word:count,word2:count2}
## Sort this based on most used word
## Print the Top 20 Words

def rank_words(f):
	"""
		Takes in a file, then ranks all the words within the file
		
		Args: a file
		
		Return: A sorted list of tuples
	"""
	word_dict = {} # Start with empty python Dictionary
	words = [] # Start with empty python List
	for line in f:
		list_of_words = line.split()
		for w in list_of_words:
			words.append(w.lower()) # Add Word to List

	for word in words:
		if word_dict.has_key(word):
			word_dict[word] += 1 # Incr. value in Dict.
		else:
			word_dict[word] = 1 # Add word and value to Dict.
        # This will sort the dictionary and return a list of Tuples
	return sorted(word_dict.iteritems(), reverse=True, \
					key=operator.itemgetter(1))


def main():
	# Files
	f = open('Alice.txt', 'rU')

	ranked_words_list = rank_words(f)

	f.close()

        # Print the results
	for w in list(ranked_words_list[:10]):
		print w[0],"---", w[1]


if __name__ == '__main__':
	main()
Here is the expected output:

Quote:$ python2 pycounter.py
the --- 1605
and --- 766
to --- 706
a --- 614
she --- 518
of --- 493
said --- 421
it --- 362
in --- 351
was --- 333

It runs. Pretty neat, eh?

But in it’s first run using the Python 3 interpreter, the trace back points to line 52:

Quote:$ python pycounter.py
File "pycounter.py", line 52
print w[0],"---", w[1]
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(w[0],"---", w[1])?

So I add the parenthesis before the w and after the second slice at that line.

When I run the script next I get this trace back:

Quote:$ python pycounter.py
pycounter.py:44: DeprecationWarning: 'U' mode is deprecated
f = open('Alice.txt', 'rU')
Traceback (most recent call last):
File "pycounter.py", line 56, in <module>
main()
File "pycounter.py", line 46, in main
ranked_words_list = rank_words(f)
File "pycounter.py", line 38, in rank_words
return sorted(word_dict.iteritems(), reverse=True, \
AttributeError: 'dict' object has no attribute 'iteritems'

The first issue is the U parameter for the open function which is no longer usable in Python 3. The official docs say so here. So I remove the U. Problem solved. But I can’t make sense of the other lines indicated in the trace back. Line 56 is the module’s __name__. I’m not sure what the problem is here. It looks normal and correct to me.

Could someone here lend a helping hand to get this script to run in Python 3?
Reply
#2
if word_dict.has_key(word):
# To
if word in word_dict:

word_dict.iteritems()
# To
word_dict.items()

f = open('Alice.txt', 'rU')
# To
f = open('Alice.txt')
A example of more modern Python power Wink
This one is also more accurate,as it remove punctuation(as should be done when count larger text).
from collections import Counter
import re

with open('alice.txt') as f:
    text = f.read().lower()

words = re.findall('\w+', text)
top_10 = Counter(words).most_common(10)
for word,count in top_10:
    print(f'{word:<4} {"-->":^4} {count:>4}')
Output:
the --> 1818 and --> 940 to --> 809 a --> 690 of --> 631 it --> 610 she --> 553 i --> 543 you --> 481 said --> 462
Reply
#3
Thank you @snippsat for your reply. I made the three corrections you suggested. I removed the has_key method. I replaced the instance of iteritems with just items. I removed the ’r’ parameter when invoking the open function. Having made these changes, my script now looks like this:

#!/usr/bin/env python
# encoding: utf-8
"""
alice_file.py

Created by Jason Elbourne on 2011-12-29.
Copyright (c) 2011 Jason Elbourne. All rights reserved.
"""
import operator

## Get each word - Turn to Lower case (.lower())
## Count Duplicates of words
## Dictionary {word:count,word2:count2}
## Sort this based on most used word
## Print the Top 20 Words

def rank_words(f):
	"""
		Takes in a file, then ranks all the words within the file
		
		Args: a file
		
		Return: A sorted list of tuples
	"""
	word_dict = {} # Start with empty python Dictionary
	words = [] # Start with empty python List
	for line in f:
		list_of_words = line.split()
		for w in list_of_words:
			words.append(w.lower()) # Add Word to List

	for word in words:
		if word in word_dict:
			word_dict[word] += 1 # Incr. value in Dict.
		else:
			word_dict[word] = 1 # Add word and value to Dict.
        # This will sort the dictionary and return a list of Tuples
	return sorted(word_dict.items(), reverse=True, \
					key=operator.itemgetter(1))


def main():
	# Files
	f = open('Alice.txt')

	ranked_words_list = rank_words(f)

	f.close()

        # Print the results
	for w in list(ranked_words_list[:10]):
		print(w[0],"---", w[1])


if __name__ == '__main__':
	main()
I’m expecting the output you shared.

When I execute the Python script in my unix shell, there is no traceback which is good. But now there is no output at all. Why? The shell just prompts me to enter another command.

The same thing happens when I run your much more elegant alternative, @snippsat. When I run $ python pycounter.py, nothing happens. Why?

I figure the reason why will probably be pretty trivial and obvious once someone here clarifies. (I am ready to be hit over the head with a clue-bat. har har)
Reply
#4
(Jan-15-2019, 02:00 AM)Drone4four Wrote: When I run $ python pycounter.py, nothing happens. Why?
You need to use $ python3 pycounter.py to use Python 3 if you have default Linux OS setup.
Test of code:
Output:
the --- 1777 and --- 833 to --- 782 a --- 670 of --- 610 she --- 518 said --- 421 in --- 412 it --- 374 was --- 334
Quote:I’m expecting the output you shared.
The output was from my code in post,so the it's different without punctuation.
Also f-string formatting so it look better.
Output:
the --> 1818 and --> 940 to --> 809 a --> 690 of --> 631 it --> 610 she --> 553 i --> 543 you --> 481 said --> 462
Reply
#5
Hi @snippsat! Thanks for your reply.

I am very much looking forward to seeing a working demo on my system of both f-string formatting and the words counted with variations of punctuation but all my output is still blank.

You said:

(Jan-15-2019, 02:40 AM)snippsat Wrote:
(Jan-15-2019, 02:00 AM)Drone4four Wrote: When I run $ python pycounter.py, nothing happens. Why?
You need to use $ python3 pycounter.py to use Python 3 if you have default Linux OS setup.

You are right that both Debian, Fedora and most of their derivatives still don’t have Python 3 installed by default. It’s still Python 2 for them users. But I’m running Manjaro stable. Here is the version of Python in my shell:

Output:
$ python --version Python 3.7.1
I attempt to run my two scripts using: $ python pycounter.py and $ python snippsat.py (I’ve named your suggested alternate script after your forum alias.)

The output is still empty. After enter these commands, my unix shell just prompts me for my next command.

For the record, here are my two scripts verbatim:

pycounter.py:

#!/usr/bin/env python
# encoding: utf-8
"""
alice_file.py

Created by Jason Elbourne on 2011-12-29.
Copyright (c) 2011 Jason Elbourne. All rights reserved.
"""
import operator

## Get each word - Turn to Lower case (.lower())
## Count Duplicates of words
## Dictionary {word:count,word2:count2}
## Sort this based on most used word
## Print the Top 20 Words

def rank_words(f):
	"""
		Takes in a file, then ranks all the words within the file
		
		Args: a file
		
		Return: A sorted list of tuples
	"""
	word_dict = {} # Start with empty python Dictionary
	words = [] # Start with empty python List
	for line in f:
		list_of_words = line.split()
		for w in list_of_words:
			words.append(w.lower()) # Add Word to List

	for word in words:
		if word in word_dict:
			word_dict[word] += 1 # Incr. value in Dict.
		else:
			word_dict[word] = 1 # Add word and value to Dict.
        # This will sort the dictionary and return a list of Tuples
	return sorted(word_dict.items(), reverse=True, \
					key=operator.itemgetter(1))


def main():
	# Files
	f = open('Alice.txt')

	ranked_words_list = rank_words(f)

	f.close()

        # Print the results
	for w in list(ranked_words_list[:10]):
		print(w[0],"---", w[1])


if __name__ == '__main__':
	main()
snippsat.py:

#!/usr/bin/env python
"""
This script was provided by snippsat on python-forum.io.
Here is the URL: https://python-forum.io/Thread-Python-word-counter-and-ranker
"""
from collections import Counter
import re
 
with open('Alice.txt') as f:
    text = f.read().lower()
 
words = re.findall('\w+', text)
top_10 = Counter(words).most_common(10)
for word,count in top_10:
    print(f'{word:<4} {"-->":^4} {count:>4}')
I'm at a loss here. Would anyone care to jump in here and save the day?

Edit: It's also worth pointing out that I have triple-checked that the text file present in my directory and the text file as referenced in both scripts match. The name of the text file in all locations is Alice.txt.
Reply
#6
(Jan-16-2019, 02:25 AM)Drone4four Wrote: Output:
Output:
$ python --version Python 3.7.1
I attempt to run my two scripts using: $ python pycounter.py and $ python snippsat.py (I’ve named your suggested alternate script after your forum alias.)
Then it should work,both scripts are okay Alice.txt most be in same folder as script as no Path is given.
Only python pycounter.py as shell should give you $.

Dos a simple hello world work?
# hello.py
import sys

print('hello world')
print(sys.version)
Run from command line with python hello.py

Other stuff to check.
which python
pip -V
Try with and without the shebang line #!/usr/bin/env python.
Reply
#7
Thanks @snippsat for your help so far. ;)

(Jan-16-2019, 07:04 AM)snippsat Wrote: stuff to check.
which python
pip -V

Here they are:
Output:
$ which python /usr/sbin/python $ pip -V pip 18.1 from /usr/lib/python3.7/site-packages/pip (python 3.7)
Then @snippsat asks:

Quote:Does a simple hello world work?
# hello.py
import sys

print('hello world')
print(sys.version)
Run from command line with python hello.py

I’ve copied these lines of code into a new hello.py script. When I run it, I get the expected output:
Output:
$ python hello.py hello world 3.7.2 (default, Dec 29 2018, 21:15:15) [GCC 8.2.1 20181127]
Quote:Try with and without the shebang line #!/usr/bin/env python.

The following is my output when running the two scripts without shebang lines:

Output:
$ python pycounter.py $ python snippsat.py
See? Still no output.

Quote:Alice.txt must be in same folder as script as no Path is given.

You are right that I do not give a path. So when I refer to Alice.txt in my scripts, it is referring to the immediate current active directory. Here are the contents of my project folder:
Output:
$ ls -la total 1744 drwxr-xr-x 2 gnull gnull 4096 Jan 18 09:04 . drwxr-xr-x 9 gnull gnull 4096 Jan 15 21:23 .. -rw-r--r-- 1 gnull gnull 1277 Mar 21 2016 Alice-modified.py~ -rw-r--r-- 1 gnull gnull 0 Jan 13 21:17 Alice.txt -rw-r--r-- 1 gnull gnull 356949 Mar 21 2016 Chesterton-World.txt -rw-r--r-- 1 gnull gnull 1254856 Mar 21 2016 heilbrons-galileo.txt -rw-r--r-- 1 gnull gnull 64 Jan 18 08:54 hello.py -rw-r--r-- 1 gnull gnull 79 Jan 12 23:03 .~lock.Chesterton-World.txt# -rw-r--r-- 1 gnull gnull 1209 Jan 18 09:04 pycounter.py -rw-r--r-- 1 gnull gnull 391 Jan 18 09:04 snippsat.py
As you can see, I do have some other text files that I am working with for further experimentation like Chesterton and Heilbron’s Galileo. Don’t mind these files. Everything else reflects what you would expect, right? You can see all four files we are working with: Alice.txt along with pycounter.py, snippsat.py along with the recently created hello.py.

!!! EDIT !!! : When I swap out Alice.txt for heilbrons-galileo.txt, I get this:

Output:
$ python snippsat.py the --> 11765 of --> 6526 and --> 5511 to --> 5341 in --> 4243 a --> 3876 galileo --> 2956 s --> 2143 that --> 2120 his --> 1918
When I open Alice.txt, it's an empty txt file. I have no idea how or why my Alice.txt is empty. I don't recall every emptying it. Ah well. But at least we figured it out. Thanks for all your help @snippsat! Both scripts run beautifully now!!!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,488 Aug-12-2021, 04:25 PM
Last Post: palladium
  Python Speech recognition, word by word AceScottie 6 15,984 Apr-12-2020, 09:50 AM
Last Post: vinayakdhage
  print a word after specific word search evilcode1 8 4,815 Oct-22-2019, 08:08 AM
Last Post: newbieAuggie2019
  How to print counter without bracket in python and sort data. phob0s 1 2,767 Jul-25-2019, 05:33 PM
Last Post: ichabod801
  Extending my text file word count ranker and calculator Drone4four 8 5,305 Jan-25-2019, 08:25 AM
Last Post: steve_shambles
  difference between word: and word[:] in for loop zowhair 2 3,664 Mar-03-2018, 07:24 AM
Last Post: zowhair
  python word-docx jon0852 0 3,282 Sep-01-2017, 04:54 AM
Last Post: jon0852

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020