For Word, Count in List (Counts.Items())

new_coder_231013 · Jul-14-2022, 12:04 PM

Hello,

I'm going through an introductory Python book and there's some code for a program that finds the most common word in a text file. Please see below for the code:

name = input('Enter file:')
handle = open(name, 'r')
counts = dict()

for line in handle:
     words = line.split()
     for word in words:
          counts[word] = counts.get(word, 0) + 1

bigcount = None
bigword = None
for word, count in list(counts.items()):
     if bigcount is None or count > bigcount:
          bigword = word
          bigcount = count

print(bigword, bigcount)

I think that I understand the first two parts: They simply open a file (as input by the user), create an empty dictionary, and loop through the file to count each of the words. But I'm getting stuck on the last part of the code.

1 ) Could someone please explain what is going on in the "for word, count in list(counts.items())" line? I think this is creating the for loop that will iterate through and define what the most common word is (as dictated by the if statement that follows). But "list" hasn't previously been mentioned or defined in the code; can someone please explain what "list" is here?

Relatedly, I know that the items() method returns a "view object" containing the key-value pairs for a dictionary. Is that "view object" something that is only displayed on the terminal or is it something Python can work with? (I take it the latter option is correct but any insight would be appreciated).

2 ) The other five lines of code in the third part are relatively simple but I'm unclear on what exactly is happening or why. For example, bigcount is defined as "None" on the first of these lines. Since it's not subsequently mentioned or modified, won't it always be "None"? And, if so, why is there a need for the other "count > bigcount" condition in the or statement that appears a few lines down? To the extent that I understand the code, it seems to me that the "bigcount = None," "bigword = None," and "if bigcount is None or count > bigcount:" lines might be unnecessary.

3 ) How does Python "know" what the most common word is? I can see (with the exception of the line with "list" in it that I don't understand) how it figures out the largest number of times that each word appears ("bigcount"). But I'm unclear on how exactly it arrives at the word that corresponds to. One of the final lines simply says "bigword = word" but I'm not clear on how that results in the most common word (though I figure the answer is probably in the line with "list" in it that I don't understand).

I anticipate that these are truly novice questions so apologies in advance as I am trying to teach myself. Any insights would be greatly appreciated. Thank you.

ndc85430 · (This post was last modified: Jul-14-2022, 01:43 PM by ndc85430.)

I don't think you need the call to list on line 12 - the view object that items returns is already something that can be iterated over.

list, though, is like dict. It creates an empty list when given no arguments, or turns its arguments into a list.

On point 2, bigcount is modified, on line 15.

It might help if you stepped through the code with pen and paper acting out what the computer would do. That should help you understand the code.

new_coder_231013 · Jul-18-2022, 12:28 PM

Thanks ndc85430. I was able to visualize the code execution using the tool available on the Python Tutor website and it helped me figure it out.

***snippsat*** · Jul-18-2022, 02:11 PM

new_coder_231013 Wrote:introductory Python book and there's some code for a program that finds the most common word in a text file
I anticipate that these are truly novice questions so apologies in advance as I am trying to teach myself. Any insights would be greatly appreciated.

The code is just kind of okay,but on most text files it will give wrong result as it doesn't considered punctuation.
So the problem is eg hello world world.,in most cases want to count 1 hello and 2 world.
As code is now it will count world. and world as different words.

Can write a at solution and can also use Counter in standard library.
Example it also has method most_common() that fit this task fine.

>>> from collections import Counter
>>> 
>>> lst = ['hello', 'world', 'world']
>>> count = Counter(lst)
>>> count
Counter({'world': 2, 'hello': 1})
>>> count.most_common()
[('world', 2), ('hello', 1)]
>>> count.most_common(1)
[('world', 2)]

So put together and also remove punctuation i use regex here.
Using alice_in_wonderland.txt as test.
Code you have posted will give this result the 1605,so 39 less the word that code under.

from collections import Counter
import re

with open('alice_wonderland.txt') as f:
    text = f.read().lower()
words = re.findall('\w+', text)
top_10 = Counter(words).most_common(10)
for word, count in top_10:
    print(f'{word:<8} --> {count:>5}')

Output:the      -->  1644
and      -->   872
to       -->   729
a        -->   632
it       -->   595
she      -->   553
i        -->   543
of       -->   514
said     -->   462
you      -->   411

If do this on word longer or equal to 5,then can guess what the story is based on result.

from collections import Counter
import re

with open('alice_wonderland.txt') as f:
    text = f.read().lower()
words = re.findall('\w+', text)
top_100 = Counter(words).most_common(100)
for word, count in top_100:
    if len(word) >= 5:
        print(f'{word:<8} --> {count:>5}')

Output:alice    -->   399
little   -->   128
there    -->    99
about    -->    94
would    -->    83
again    -->    83
herself  -->    83
could    -->    77
queen    -->    75
thought  -->    74
turtle   -->    59
began    -->    58
hatter   -->    56
quite    -->    55
gryphon  -->    55
think    -->    53
their    -->    52
rabbit   -->    51
first    -->    51

new_coder_231013 · Jul-20-2022, 11:17 AM

Thanks snippsat! I've looked over what you wrote and it actually generated some more questions.

1 ) I’m curious why the code from the introductory Python book ignores punctuation. Does it have something to do with the way the split() method works? Or possibly the get function (which appears two lines down from the split() method in the code)?

2 ) You provided three separate sets of code. Does the first set of code you provided need to go with the second set of code in order for the second set of code to work? I think the answer is "No" but I'm not sure.

3 ) On line 8 in the second set of code, doesn’t the for loop need to read “for word in words” (since word hasn’t been previously defined and Python needs to know which part of the code to iterate through)?

Also on lines 8 and 9, where is count coming from or what is it doing? Is this the count() method (used for lists)? Or is it what was already defined as “count = Counter(1st)” in the previous set of code? Which method or function is "count" here?

I’m trying to figure out what is actually going on in the print statement on the last line of the second set of code you wrote. At first I thought that it would only print those sets of words that have 1) A “value” of less than 8 (that value being defined as the number of characters in the string) and 2) That appear less than five times in the text. But I modified the code slightly to experiment with this and it returned words that violated these conditions:

from collections import Counter
import re

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse porta neque nulla, condimentum ultrices tortor vehicula gravida. Suspendisse a lacinia nunc. Etiam lobortis nunc ipsum, ac malesuada erat fermentum id. Sed arcu neque, fringilla ut mauris sit amet, vehicula auctor augue. Pellentesque ut mi erat. Donec consequat eleifend aliquam. Proin"
readtext = text.lower()
words = re.findall ('\w+', readtext)
top_10 = Counter(words).most_common(10)
for word, count in top_10:
     print(f'{word:<8} --> {count:>5}')

Output:ipsum    -->     2
sit      -->     2
amet     -->     2
suspendisse -->     2
neque    -->     2
vehicula -->     2
nunc     -->     2
erat     -->     2
ut       -->     2
lorem    -->     1

I’m also unclear on what is going on with “count” in the for loop (I’m familiar with using “in” to check whether something appears in something else but I’m not sure how “count” – which hasn’t previously been defined and which I’m not sure which function or method it is – would play into that).

Thanks again for your help and have a great day.

***snippsat*** · (This post was last modified: Jul-20-2022, 01:47 PM by snippsat.)

(Jul-20-2022, 11:17 AM)new_coder_231013 Wrote: 1 ) I’m curious why the code from the introductory Python book ignores punctuation. Does it have something to do with the way the split() method works? Or possibly the get function (which appears two lines down from the split() method in the code)?

It can be easy to forget if just test with a small text file then it can work sometime.
It has nothing to with spilt(),code in book don't do anything about punctuation.

(Jul-20-2022, 11:17 AM)new_coder_231013 Wrote: 2 ) You provided three separate sets of code. Does the first set of code you provided need to go with the second set of code in order for the second set of code to work? I think the answer is "No" but I'm not sure.

The first code is just a demonstration how Counter work.
Are you familiar with interactive testing(has >>>),test my first code out yourself interactive.
The two last is almost same just(line 9) in last count only word over length 5.

Quote:3 ) On line 8 in the second set of code, doesn’t the for loop need to read “for word in words” (since word hasn’t been previously defined and Python needs to know which part of the code to iterate through)?

Now it loop over the list that top_10 that get from the Counter most_common(10) method.
Learn to look at what code dos(use print and interactive interpreter) then can just test it out.

>>> top_10
[('the', 1644),
 ('and', 872),
 ('to', 729),
 ('a', 632),
 ('it', 595),
 ('she', 553),
 ('i', 543),
 ('of', 514),
 ('said', 462),
 ('you', 411)]
>>> 
>>> type(top_10)
<class 'list'>

Quote:I’m trying to figure out what is actually going on in the print statement on the last line of the second set of code you wrote. At first I thought that it would only print those sets of words that have 1) A “value” of less than 8 (that value being defined as the number of characters in the string) and 2) That appear less than five times in the text. But I modified the code slightly to experiment with this and it returned words that violated these conditions:

In print i use f-string the values is for Padding and aligning in strings.
If change to this it will look okay for longer sting.

print(f'{word:<12} --> {count:>5}')

Output:ipsum        -->     2
sit          -->     2
amet         -->     2
suspendisse  -->     2
neque        -->     2
vehicula     -->     2
nunc         -->     2
erat         -->     2
ut           -->     2
lorem        -->     1

Here a demo of f-string that you can test out.

>>> name = 'f-string'
>>> print(f"String formatting is called {name.upper():*^20}")
String formatting is called ******F-STRING******
 
# f-strings can take any Python expressions inside the curly braces.
>>> cost = 99.75999
>>> finance = 50000
>>> print(f'Toltal cost {cost + finance:.2f}')
Toltal cost 50099.76
   
>>> for word in 'f-strings are cool'.split():
...     print(f'{word.upper():~^20}')
...
~~~~~F-STRINGS~~~~~~
~~~~~~~~ARE~~~~~~~~~
~~~~~~~~COOL~~~~~~~~

I change to this you see padding and aligning change.

>>> name = 'f-string'
>>> print(f"String formatting is called {name.upper():*>20}")
String formatting is called ************F-STRING

new_coder_231013 · Jul-21-2022, 02:51 PM

Thanks snippsat. I looked over everything you wrote and looked a few things up and everything is clear now. Thanks again.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Converting column of values into muliple columns of counts	highland44	0	945	Feb-01-2024, 12:48 AM Last Post: highland44
	Trying to get counts/sum/percentages from pandas similar to pivot table	cubangt	6	3,546	Oct-06-2023, 04:32 PM Last Post: cubangt
	Function to count words in a list up to and including Sam	Oldman45	15	10,980	Sep-08-2023, 01:10 PM Last Post: Pedroski55
	Why do I have to repeat items in list slices in order to make this work?	Pythonica	7	3,039	May-22-2023, 10:39 PM Last Post: ICanIBB
	Finding combinations of list of items (30 or so)	LynnS	1	1,582	Jan-25-2023, 02:57 PM Last Post: deanhystad
	Row Count and coloumn count	Yegor123	4	2,944	Oct-18-2022, 03:52 AM Last Post: Yegor123
	Read All Emails from Outlook and add the word counts to a DataFrame	sanaman_2000	0	2,850	Sep-15-2022, 07:32 AM Last Post: sanaman_2000
	find some word in text list file and a bit change to them	RolanRoll	3	2,485	Jun-27-2022, 01:36 AM Last Post: RolanRoll
	How to get list of exactly 10 items?	Mark17	1	3,777	May-26-2022, 01:37 PM Last Post: Mark17
	How to get unique entries in a list and the count of occurrence	james2009	5	4,294	May-08-2022, 04:34 AM Last Post: ndc85430

For Word, Count in List (Counts.Items())

User Panel Messages

Announcements