Return five most frequent words

Elisabet · (This post was last modified: Dec-06-2023, 11:07 PM by deanhystad.)

Hello, I'm pretty new to coding and have started attending an online course on programming techniques. I'm stuck on one part of an assignment where I have to return the 5 most frequent words in a text file, disregarding any stop words. I'm not allowed to use any modules. My function only returns the index numbers and not the actual words. Does anyone have any tips on what I can do?

# Returns 5 most frequent words in a text
def important_words(an_index, stop_words):
    mydict = index_text(an_index)   # takes the result from index_text function
    all_words = []                  # Create an empty list to put in all words from an_index in
    
    # Tar bort stop_words från text
    for item in stop_words:
        if item in an_index:
            del an_index[item]
    
    # Combine all the words into a single list
    for key in mydict:
        all_words.extend(mydict[key])    
        
        # Count occurrences of each word
    word_counter = {}                    # This dictionary stores words as keys and their respective numbers as values
    for word in all_words:
        if word not in stop_words:       # If the word is a stop-word, it's ignored and not included in word_counter
            if word in word_counter:    
                word_counter[word] += 1  # If the word has already been inserted into the dictionary, it adds to the count by 1
            else:
                word_counter[word] = 1   # If the word isn't already in word_counter, it will be added
                
    sorted_words = sorted(word_counter.items(), key=lambda x: x[1], reverse=True)   # Sorts tuple, x[1] takes the second elements of the tuple (the values).
    # The order of the tuple is reversed, starting with the largest values to the smallest.
    top_words = [word[0] for word in sorted_words[:5]]
    return top_words   # Returns the five most frequent words in the text

deanhystad write Dec-06-2023, 11:07 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

rob101 · (This post was last modified: Dec-07-2023, 12:40 AM by rob101.)

Possibly, you're over engineering this.

I would start with something basic, like this:

text = """This is a text object.
   This object has many words. I will now count
   the words and return most frequent ones, in
   ascending order."""

word_list = [word for word in text.split()]

word_list.sort()

for word in word_list:
    print(word)

... and then figure how to count the occurrence of the words in that list. Note that you'll need to figure how to distinguish between object and object., in the above example, because the period will mean that the two are not the same.

There maybe a floor in my crude example, which will make it unworkable, but it's a starting point and (like I say) where I would start from.

For the so-called "stop words", simply iterate over the word list, and remove them. That way, they'll not form part of the counting process.

DPaul · (This post was last modified: Dec-07-2023, 07:28 AM by DPaul.)

Hi,
As Rob stated, you have to get rid of the punctuation.
And, take upper() and lower() case into account.
Also assuming numbers count as "words".
Max 4 lines of code to do this. Smile

Paul

**deanhystad** · Dec-07-2023, 05:25 PM

The question has to do with why the function returns a list of numbers instead of a list of words. We can assume that the original word list has already been processed so there is no punctuation and capitalization issues are resolved before calling the function.

You did not include the code for the index_text() function, but I think the mistake may be there. Looking at your code, logic dictates index_text returns a index: word dictionary. When I write the function so it returns a word: index dictionary your code returns a list of index numbers. So I wrote the function like this:

text = "i will now count the words and return the most frequent words in ascending order".split()
an_index = list(range(len(text)))

def index_text(index):
    return {text[i]: i for i in index}

This uncovers other errors. For instance, this code does nothing:

    for item in stop_words:
        if item in an_index:
            del an_index[item]

By the time this code runs, your function is done using an_index.

This code is a problem. Read about list.extend() and list.append(). You are using the wrong one.

    # Combine all the words into a single list
    for key in mydict:
        all_words.extend(mydict[key])

extend(mydict[key]) treats mydict[key] as a sequence of letters that are appended to all_words. You want to append(mydict[key]). If you read about dictionaries you would see there is a function that returns all the dictionary values so you don't need this loop at all.

Your dictionary ready would also uncover a cleaner way to do this:

            if word in word_counter:    
                word_counter[word] += 1  # If the word has already been inserted into the dictionary, it adds to the count by 1
            else:
                word_counter[word] = 1   # If the word isn't already in word_counter, it will be added

if statements in python often indicate you are doing something wrong. There is no error in this code, it is just longer than it needs to be.

Elisabet · Dec-11-2023, 09:08 PM

Thank you for the help. I realized that the course I'm taking is not very beginner-friendly, and that this task was too much for me at the moment. I'm grateful that you guys tried to explain this for me anyway!

karimali · (This post was last modified: Jan-19-2025, 09:02 AM by Gribouillis.)

I can help you with these as I already created Link Removed using my skills and good things is it looks like you’re almost there, but there are a couple of issues in your code that need fixing.

First, the issue you're encountering is likely related to how you are manipulating the an_index variable and its structure. In the loop where you’re removing stop words from an_index, you're deleting items from an_index directly, but an_index is passed as an argument, and it’s not clear how it’s structured. It might not be a dictionary, or it could be that the keys you're attempting to delete don’t match the structure you expect.

Here’s a revised version of your function:

def important_words(an_index, stop_words):
    mydict = index_text(an_index)  # assumes this returns a dictionary {index: [words]}
    all_words = []  # To collect all words from the index

    # Remove stop words from the text
    for key in mydict:
        filtered_words = [word for word in mydict[key] if word not in stop_words]
        all_words.extend(filtered_words)  # Add the filtered words to the list

    # Count occurrences of each word
    word_counter = {}
    for word in all_words:
        word_counter[word] = word_counter.get(word, 0) + 1  # Count the word

    # Sort words by frequency in descending order
    sorted_words = sorted(word_counter.items(), key=lambda x: x[1], reverse=True)
    
    # Get the top 5 frequent words
    top_words = [word[0] for word in sorted_words[:5]]
    return top_words

Gribouillis write Jan-19-2025, 09:02 AM:
Clickbait link removed. Please read What to NOT include in a post

Larz60+ write Jan-19-2025, 08:57 AM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Tags have been added this time. Please use BBCode tags on future posts.

Pedroski55 · (This post was last modified: Jan-23-2025, 05:23 AM by Pedroski55.)

Funny, in school I never liked homework!

Just out of interest, I tried like this.

path2text = '/home/pedro/temp/Frankenstein_Letter_1.txt'

with open(path2text) as f:
    words = f.read()
    # make everything capital letters or The and the will be 2 different words
    words = words.upper()

# find punctuation and numbers
unwanted = []
for w in words:
    if not w.isalpha():
        if not w in unwanted:
            unwanted.append(w)

# unwanted  looks like: ['_', ' ', '.', ',', '\n', '1', '7', '—', '?', ';', '’', '-', '!', ':', "'"]
len(unwanted) # 15

words_list = words.split()
len(words_list) # 1221
for i in range(len(words_list)):
    for u in unwanted:
        if u in words_list[i]:
            words_list[i] = words_list[i].replace(u, '')
    
words_set = set(words_list)
len(words_set) # 586

# a dictionary to hold the count of each word        
words_dict = {w:0 for w in words_set}

# loop through words_list and increase the count for each dictionary key
for word in words_list:
    words_dict[word] +=1

# make a list of (word, count) tuples
tups = [(key, words_dict[key]) for key in words_dict.keys()]
# sort tups by tup[1], the count reversed so the highest count comes first
tups.sort(key=lambda tup: tup[1], reverse=True)
# show the results
for i in range(5):
    print(tups[i])

Output:('THE', 67)
('I', 46)
('AND', 44)
('MY', 42)
('OF', 36)

If you were allowed to use a regex, you can get the words_list more easily and words like The and the will count as different words, unless you change everything to uppercase or lowercase first.

I put some other words in my English text like: 'Ödipus', 'Müttern', 'über', 'Vätern', 'dächten', 'wäre', 'naïve' just to see how the regex coped with them. No problems!

import re

# allow for other characters than a-zA-Z   
e = re.compile(r'\b[A-Za-züäÜÄÖöï]+\b')
words_list = e.findall(words)
# carry on from here as above, but no need to lose numbers and punctuation

Return five most frequent words

User Panel Messages

Announcements