Python Forum

Full Version: Return five most frequent words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello, I'm pretty new to coding and have started attending an online course on programming techniques. I'm stuck on one part of an assignment where I have to return the 5 most frequent words in a text file, disregarding any stop words. I'm not allowed to use any modules. My function only returns the index numbers and not the actual words. Does anyone have any tips on what I can do?

# Returns 5 most frequent words in a text
def important_words(an_index, stop_words):
    mydict = index_text(an_index)   # takes the result from index_text function
    all_words = []                  # Create an empty list to put in all words from an_index in
    
    # Tar bort stop_words från text
    for item in stop_words:
        if item in an_index:
            del an_index[item]
    
    # Combine all the words into a single list
    for key in mydict:
        all_words.extend(mydict[key])    
        
        # Count occurrences of each word
    word_counter = {}                    # This dictionary stores words as keys and their respective numbers as values
    for word in all_words:
        if word not in stop_words:       # If the word is a stop-word, it's ignored and not included in word_counter
            if word in word_counter:    
                word_counter[word] += 1  # If the word has already been inserted into the dictionary, it adds to the count by 1
            else:
                word_counter[word] = 1   # If the word isn't already in word_counter, it will be added
                
    sorted_words = sorted(word_counter.items(), key=lambda x: x[1], reverse=True)   # Sorts tuple, x[1] takes the second elements of the tuple (the values).
    # The order of the tuple is reversed, starting with the largest values to the smallest.
    top_words = [word[0] for word in sorted_words[:5]]
    return top_words   # Returns the five most frequent words in the text
Possibly, you're over engineering this.

I would start with something basic, like this:

text = """This is a text object.
   This object has many words. I will now count
   the words and return most frequent ones, in
   ascending order."""

word_list = [word for word in text.split()]

word_list.sort()

for word in word_list:
    print(word)
... and then figure how to count the occurrence of the words in that list. Note that you'll need to figure how to distinguish between object and object., in the above example, because the period will mean that the two are not the same.

There maybe a floor in my crude example, which will make it unworkable, but it's a starting point and (like I say) where I would start from.

For the so-called "stop words", simply iterate over the word list, and remove them. That way, they'll not form part of the counting process.
Hi,
As Rob stated, you have to get rid of the punctuation.
And, take upper() and lower() case into account.
Also assuming numbers count as "words".
Max 4 lines of code to do this. Smile
Paul
The question has to do with why the function returns a list of numbers instead of a list of words. We can assume that the original word list has already been processed so there is no punctuation and capitalization issues are resolved before calling the function.

You did not include the code for the index_text() function, but I think the mistake may be there. Looking at your code, logic dictates index_text returns a index: word dictionary. When I write the function so it returns a word: index dictionary your code returns a list of index numbers. So I wrote the function like this:
text = "i will now count the words and return the most frequent words in ascending order".split()
an_index = list(range(len(text)))

def index_text(index):
    return {text[i]: i for i in index}
This uncovers other errors. For instance, this code does nothing:
    for item in stop_words:
        if item in an_index:
            del an_index[item]
By the time this code runs, your function is done using an_index.

This code is a problem. Read about list.extend() and list.append(). You are using the wrong one.
    # Combine all the words into a single list
    for key in mydict:
        all_words.extend(mydict[key])    
extend(mydict[key]) treats mydict[key] as a sequence of letters that are appended to all_words. You want to append(mydict[key]). If you read about dictionaries you would see there is a function that returns all the dictionary values so you don't need this loop at all.

Your dictionary ready would also uncover a cleaner way to do this:
            if word in word_counter:    
                word_counter[word] += 1  # If the word has already been inserted into the dictionary, it adds to the count by 1
            else:
                word_counter[word] = 1   # If the word isn't already in word_counter, it will be added
if statements in python often indicate you are doing something wrong. There is no error in this code, it is just longer than it needs to be.
Thank you for the help. I realized that the course I'm taking is not very beginner-friendly, and that this task was too much for me at the moment. I'm grateful that you guys tried to explain this for me anyway!