Counting the most relevant words in a text file

caiomartins · (This post was last modified: Sep-20-2020, 02:02 PM by caiomartins.)

I am totally fresh to Python and recently I have been struggling to read a text file. As a personal studying project.

I have "built" this code based on some examples I saw only, but it is just not quite working. I am describing on the code, what the goal would be for each line. Does anyone know why is it not working? And also how could I fill the empty lines?

Here it follows my code:

 
import re
import collections
from collections import Counter
import nltk
from nltk.corpus import stopwords


# Open text file to look for the most common words
tex = open("SchindlerListEng.txt","rt")

# List of not relevant words in english
stopwords = stopwords.words('english')

# Instantiate a dictionary, and for every word in the file. 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

for line in tex:
        for word in line.split():
            word = word.replace(".","")
            word = word.replace(",","")
            word = word.replace(":","")
            word = word.replace("\"","")
            word = word.replace("!","")
            word = word.replace("â€œ","")
            word = word.replace("â€˜","")
            word = word.replace("*","")
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
            else:
                wordcount[word] += 1

# Remove words within brackets [] - Not sure how to do it

# Create a new list that contains punctuation we wish to clean. I also added new words to clean
punctuations = ['(',')',';',':','[',']',',','.','!','-','?',"''",'...','mr','mrs','one','two','said','PLASZOW','That','This','FACTORY',"'s",'And',"'ve","n't","'d",'What','There','EXT','INT','SCHINDLER','GOETH','STERN','A','one','three','I','The','He','It','DAY',"'re",'They','You',"'ll","'m",'``',"'S",'NIGHT','BRINNLITZ','As']

#Create a list comprehension that only returns a list of words that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tex if not word in stopwords and not word in punctuations]

#Count the number of times that a relevant word appear
kcounter = Counter(keywords)

#Show the top 100
print (kcounter.most_common(100))

#Export the top 100 counter to a table format ( Word | Count ) to a CSV file

The output I am getting is:
[Image: 9Wy6N.png]

An example of the output I would like to get - this is the one I got by analysing a PDF, that had similar content to my text file.
[Image: FLVb4.png]

But on top of that, I would like to export the output in a table format ( Word | Count ), but for that one, I have no idea of how to write the line of code.

About the packages

From what I studied for this project, NLTK is a package focused on providing functions to human language. Stopwords(text,'english') would locate all the words in the English language that are not very relant when evaluating a text, we can understand as "frequent words such as "I, It, the" and others.

Collections is the package to call the function counter(), that will count how many times each word appears in the text.

I am using Python 3.8

PS: I tried to post on Stackoverflow, but I am probably too much of a beginner, and the question wasn't quite suitable there.

I much appreciate any help!!

**deanhystad** · (This post was last modified: Sep-20-2020, 04:34 PM by deanhystad.)

import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)

I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.

caiomartins · Sep-21-2020, 08:39 AM

(Sep-20-2020, 04:34 PM)deanhystad Wrote:

import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)

I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.

Hi Dean, thank you very much!
I believe that will work perfectly. I will have to read more about regex to understand the difference.
Thanks again!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Form that puts diacritics on the words in the text	Melcu54	13	3,866	Aug-22-2023, 07:07 AM Last Post: Pedroski55
	Need to compare the Excel file name with a directory text file.	veeran1991	1	2,090	Dec-15-2022, 04:32 PM Last Post: Larz60+
	Python modules for accessing the configuration of relevant paths	Imago	1	2,493	May-07-2022, 07:28 PM Last Post: Larz60+
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	3,241	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Converted Pipe Delimited text file to CSV file	atomxkai	4	11,235	Feb-11-2022, 12:38 AM Last Post: atomxkai
	Extract a string between 2 words from a text file	OscarBoots	2	2,816	Nov-02-2021, 08:50 AM Last Post: ibreeden
	Generate a string of words for multiple lists of words in txt files in order.	AnicraftPlayz	2	4,105	Aug-11-2021, 03:45 PM Last Post: jamesaarr
	Open and read multiple text files and match words	kozaizsvemira	3	8,748	Jul-07-2021, 11:27 AM Last Post: Larz60+
	regex pattern to extract relevant sentences	Bubly	2	2,721	Jul-06-2021, 04:17 PM Last Post: Bubly
	Counting number of words and organize for the bigger frequencies to the small ones.	valeriorsneto	1	2,320	Feb-05-2021, 03:49 PM Last Post: perfringo

Counting the most relevant words in a text file

User Panel Messages

Announcements