Python Forum
For Word, Count in List (Counts.Items())
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
For Word, Count in List (Counts.Items())
#1
Hello,

I'm going through an introductory Python book and there's some code for a program that finds the most common word in a text file. Please see below for the code:

name = input('Enter file:')
handle = open(name, 'r')
counts = dict()

for line in handle:
     words = line.split()
     for word in words:
          counts[word] = counts.get(word, 0) + 1

bigcount = None
bigword = None
for word, count in list(counts.items()):
     if bigcount is None or count > bigcount:
          bigword = word
          bigcount = count

print(bigword, bigcount)
I think that I understand the first two parts: They simply open a file (as input by the user), create an empty dictionary, and loop through the file to count each of the words. But I'm getting stuck on the last part of the code.

1 ) Could someone please explain what is going on in the "for word, count in list(counts.items())" line? I think this is creating the for loop that will iterate through and define what the most common word is (as dictated by the if statement that follows). But "list" hasn't previously been mentioned or defined in the code; can someone please explain what "list" is here?

Relatedly, I know that the items() method returns a "view object" containing the key-value pairs for a dictionary. Is that "view object" something that is only displayed on the terminal or is it something Python can work with? (I take it the latter option is correct but any insight would be appreciated).

2 ) The other five lines of code in the third part are relatively simple but I'm unclear on what exactly is happening or why. For example, bigcount is defined as "None" on the first of these lines. Since it's not subsequently mentioned or modified, won't it always be "None"? And, if so, why is there a need for the other "count > bigcount" condition in the or statement that appears a few lines down? To the extent that I understand the code, it seems to me that the "bigcount = None," "bigword = None," and "if bigcount is None or count > bigcount:" lines might be unnecessary.

3 ) How does Python "know" what the most common word is? I can see (with the exception of the line with "list" in it that I don't understand) how it figures out the largest number of times that each word appears ("bigcount"). But I'm unclear on how exactly it arrives at the word that corresponds to. One of the final lines simply says "bigword = word" but I'm not clear on how that results in the most common word (though I figure the answer is probably in the line with "list" in it that I don't understand).

I anticipate that these are truly novice questions so apologies in advance as I am trying to teach myself. Any insights would be greatly appreciated. Thank you.
Reply
#2
I don't think you need the call to list on line 12 - the view object that items returns is already something that can be iterated over.

list, though, is like dict. It creates an empty list when given no arguments, or turns its arguments into a list.

On point 2, bigcount is modified, on line 15.

It might help if you stepped through the code with pen and paper acting out what the computer would do. That should help you understand the code.
Reply
#3
Thanks ndc85430. I was able to visualize the code execution using the tool available on the Python Tutor website and it helped me figure it out.
Reply
#4
new_coder_231013 Wrote:introductory Python book and there's some code for a program that finds the most common word in a text file
I anticipate that these are truly novice questions so apologies in advance as I am trying to teach myself. Any insights would be greatly appreciated.
The code is just kind of okay,but on most text files it will give wrong result as it doesn't considered punctuation.
So the problem is eg hello world world.,in most cases want to count 1 hello and 2 world.
As code is now it will count world. and world as different words.

Can write a at solution and can also use Counter in standard library.
Example it also has method most_common() that fit this task fine.
>>> from collections import Counter
>>> 
>>> lst = ['hello', 'world', 'world']
>>> count = Counter(lst)
>>> count
Counter({'world': 2, 'hello': 1})
>>> count.most_common()
[('world', 2), ('hello', 1)]
>>> count.most_common(1)
[('world', 2)]
So put together and also remove punctuation i use regex here.
Using alice_in_wonderland.txt as test.
Code you have posted will give this result the 1605,so 39 less the word that code under.
from collections import Counter
import re

with open('alice_wonderland.txt') as f:
    text = f.read().lower()
words = re.findall('\w+', text)
top_10 = Counter(words).most_common(10)
for word, count in top_10:
    print(f'{word:<8} --> {count:>5}')
Output:
the --> 1644 and --> 872 to --> 729 a --> 632 it --> 595 she --> 553 i --> 543 of --> 514 said --> 462 you --> 411
If do this on word longer or equal to 5,then can guess what the story is based on result.
from collections import Counter
import re

with open('alice_wonderland.txt') as f:
    text = f.read().lower()
words = re.findall('\w+', text)
top_100 = Counter(words).most_common(100)
for word, count in top_100:
    if len(word) >= 5:
        print(f'{word:<8} --> {count:>5}')
Output:
alice --> 399 little --> 128 there --> 99 about --> 94 would --> 83 again --> 83 herself --> 83 could --> 77 queen --> 75 thought --> 74 turtle --> 59 began --> 58 hatter --> 56 quite --> 55 gryphon --> 55 think --> 53 their --> 52 rabbit --> 51 first --> 51
Reply
#5
Thanks snippsat! I've looked over what you wrote and it actually generated some more questions.

1 ) I’m curious why the code from the introductory Python book ignores punctuation. Does it have something to do with the way the split() method works? Or possibly the get function (which appears two lines down from the split() method in the code)?

2 ) You provided three separate sets of code. Does the first set of code you provided need to go with the second set of code in order for the second set of code to work? I think the answer is "No" but I'm not sure.

3 ) On line 8 in the second set of code, doesn’t the for loop need to read “for word in words” (since word hasn’t been previously defined and Python needs to know which part of the code to iterate through)?

Also on lines 8 and 9, where is count coming from or what is it doing? Is this the count() method (used for lists)? Or is it what was already defined as “count = Counter(1st)” in the previous set of code? Which method or function is "count" here?

I’m trying to figure out what is actually going on in the print statement on the last line of the second set of code you wrote. At first I thought that it would only print those sets of words that have 1) A “value” of less than 8 (that value being defined as the number of characters in the string) and 2) That appear less than five times in the text. But I modified the code slightly to experiment with this and it returned words that violated these conditions:

from collections import Counter
import re

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse porta neque nulla, condimentum ultrices tortor vehicula gravida. Suspendisse a lacinia nunc. Etiam lobortis nunc ipsum, ac malesuada erat fermentum id. Sed arcu neque, fringilla ut mauris sit amet, vehicula auctor augue. Pellentesque ut mi erat. Donec consequat eleifend aliquam. Proin"
readtext = text.lower()
words = re.findall ('\w+', readtext)
top_10 = Counter(words).most_common(10)
for word, count in top_10:
     print(f'{word:<8} --> {count:>5}')
Output:
ipsum --> 2 sit --> 2 amet --> 2 suspendisse --> 2 neque --> 2 vehicula --> 2 nunc --> 2 erat --> 2 ut --> 2 lorem --> 1
I’m also unclear on what is going on with “count” in the for loop (I’m familiar with using “in” to check whether something appears in something else but I’m not sure how “count” – which hasn’t previously been defined and which I’m not sure which function or method it is – would play into that).

Thanks again for your help and have a great day.
Reply
#6
(Jul-20-2022, 11:17 AM)new_coder_231013 Wrote: 1 ) I’m curious why the code from the introductory Python book ignores punctuation. Does it have something to do with the way the split() method works? Or possibly the get function (which appears two lines down from the split() method in the code)?
It can be easy to forget if just test with a small text file then it can work sometime.
It has nothing to with spilt(),code in book don't do anything about punctuation.

(Jul-20-2022, 11:17 AM)new_coder_231013 Wrote: 2 ) You provided three separate sets of code. Does the first set of code you provided need to go with the second set of code in order for the second set of code to work? I think the answer is "No" but I'm not sure.
The first code is just a demonstration how Counter work.
Are you familiar with interactive testing(has >>>),test my first code out yourself interactive.
The two last is almost same just(line 9) in last count only word over length 5.
Quote:3 ) On line 8 in the second set of code, doesn’t the for loop need to read “for word in words” (since word hasn’t been previously defined and Python needs to know which part of the code to iterate through)?
Now it loop over the list that top_10 that get from the Counter most_common(10) method.
Learn to look at what code dos(use print and interactive interpreter) then can just test it out.
>>> top_10
[('the', 1644),
 ('and', 872),
 ('to', 729),
 ('a', 632),
 ('it', 595),
 ('she', 553),
 ('i', 543),
 ('of', 514),
 ('said', 462),
 ('you', 411)]
>>> 
>>> type(top_10)
<class 'list'>
Quote:I’m trying to figure out what is actually going on in the print statement on the last line of the second set of code you wrote. At first I thought that it would only print those sets of words that have 1) A “value” of less than 8 (that value being defined as the number of characters in the string) and 2) That appear less than five times in the text. But I modified the code slightly to experiment with this and it returned words that violated these conditions:
In print i use f-string the values is for Padding and aligning in strings.
If change to this it will look okay for longer sting.
print(f'{word:<12} --> {count:>5}') 
Output:
ipsum --> 2 sit --> 2 amet --> 2 suspendisse --> 2 neque --> 2 vehicula --> 2 nunc --> 2 erat --> 2 ut --> 2 lorem --> 1
Here a demo of f-string that you can test out.
>>> name = 'f-string'
>>> print(f"String formatting is called {name.upper():*^20}")
String formatting is called ******F-STRING******
 
# f-strings can take any Python expressions inside the curly braces.
>>> cost = 99.75999
>>> finance = 50000
>>> print(f'Toltal cost {cost + finance:.2f}')
Toltal cost 50099.76
   
>>> for word in 'f-strings are cool'.split():
...     print(f'{word.upper():~^20}')
...
~~~~~F-STRINGS~~~~~~
~~~~~~~~ARE~~~~~~~~~
~~~~~~~~COOL~~~~~~~~
I change to this you see padding and aligning change.
>>> name = 'f-string'
>>> print(f"String formatting is called {name.upper():*>20}")
String formatting is called ************F-STRING
Reply
#7
Thanks snippsat. I looked over everything you wrote and looked a few things up and everything is clear now. Thanks again.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to parse and group hierarchical list items from an unindented string in Python? ann23fr 0 182 Mar-27-2024, 01:16 PM
Last Post: ann23fr
  Converting column of values into muliple columns of counts highland44 0 252 Feb-01-2024, 12:48 AM
Last Post: highland44
  Trying to get counts/sum/percentages from pandas similar to pivot table cubangt 6 1,390 Oct-06-2023, 04:32 PM
Last Post: cubangt
  Function to count words in a list up to and including Sam Oldman45 15 6,556 Sep-08-2023, 01:10 PM
Last Post: Pedroski55
  Why do I have to repeat items in list slices in order to make this work? Pythonica 7 1,322 May-22-2023, 10:39 PM
Last Post: ICanIBB
  Finding combinations of list of items (30 or so) LynnS 1 873 Jan-25-2023, 02:57 PM
Last Post: deanhystad
  Row Count and coloumn count Yegor123 4 1,322 Oct-18-2022, 03:52 AM
Last Post: Yegor123
  Read All Emails from Outlook and add the word counts to a DataFrame sanaman_2000 0 1,854 Sep-15-2022, 07:32 AM
Last Post: sanaman_2000
  find some word in text list file and a bit change to them RolanRoll 3 1,523 Jun-27-2022, 01:36 AM
Last Post: RolanRoll
  How to get list of exactly 10 items? Mark17 1 2,507 May-26-2022, 01:37 PM
Last Post: Mark17

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020