Python Forum
Counting the most relevant words in a text file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Counting the most relevant words in a text file (/thread-29797.html)



Counting the most relevant words in a text file - caiomartins - Sep-20-2020

I am totally fresh to Python and recently I have been struggling to read a text file. As a personal studying project.

I have "built" this code based on some examples I saw only, but it is just not quite working. I am describing on the code, what the goal would be for each line. Does anyone know why is it not working? And also how could I fill the empty lines?

Here it follows my code:

 
import re
import collections
from collections import Counter
import nltk
from nltk.corpus import stopwords


# Open text file to look for the most common words
tex = open("SchindlerListEng.txt","rt")

# List of not relevant words in english
stopwords = stopwords.words('english')

# Instantiate a dictionary, and for every word in the file. 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

for line in tex:
        for word in line.split():
            word = word.replace(".","")
            word = word.replace(",","")
            word = word.replace(":","")
            word = word.replace("\"","")
            word = word.replace("!","")
            word = word.replace("“","")
            word = word.replace("‘","")
            word = word.replace("*","")
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
            else:
                wordcount[word] += 1

# Remove words within brackets [] - Not sure how to do it

# Create a new list that contains punctuation we wish to clean. I also added new words to clean
punctuations = ['(',')',';',':','[',']',',','.','!','-','?',"''",'...','mr','mrs','one','two','said','PLASZOW','That','This','FACTORY',"'s",'And',"'ve","n't","'d",'What','There','EXT','INT','SCHINDLER','GOETH','STERN','A','one','three','I','The','He','It','DAY',"'re",'They','You',"'ll","'m",'``',"'S",'NIGHT','BRINNLITZ','As']

#Create a list comprehension that only returns a list of words that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tex if not word in stopwords and not word in punctuations]

#Count the number of times that a relevant word appear
kcounter = Counter(keywords)

#Show the top 100
print (kcounter.most_common(100))

#Export the top 100 counter to a table format ( Word | Count ) to a CSV file
The output I am getting is:
[Image: 9Wy6N.png]

An example of the output I would like to get - this is the one I got by analysing a PDF, that had similar content to my text file.
[Image: FLVb4.png]

But on top of that, I would like to export the output in a table format ( Word | Count ), but for that one, I have no idea of how to write the line of code.

About the packages

From what I studied for this project, NLTK is a package focused on providing functions to human language. Stopwords(text,'english') would locate all the words in the English language that are not very relant when evaluating a text, we can understand as "frequent words such as "I, It, the" and others.

Collections is the package to call the function counter(), that will count how many times each word appears in the text.

I am using Python 3.8

PS: I tried to post on Stackoverflow, but I am probably too much of a beginner, and the question wasn't quite suitable there.

I much appreciate any help!!


RE: Counting the most relevant words in a text file - deanhystad - Sep-20-2020

import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)
I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.


RE: Counting the most relevant words in a text file - caiomartins - Sep-21-2020

(Sep-20-2020, 04:34 PM)deanhystad Wrote:
import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)
I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.

Hi Dean, thank you very much!
I believe that will work perfectly. I will have to read more about regex to understand the difference.
Thanks again!