Python Forum
Counting the most relevant words in a text file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Counting the most relevant words in a text file
#1
I am totally fresh to Python and recently I have been struggling to read a text file. As a personal studying project.

I have "built" this code based on some examples I saw only, but it is just not quite working. I am describing on the code, what the goal would be for each line. Does anyone know why is it not working? And also how could I fill the empty lines?

Here it follows my code:

 
import re
import collections
from collections import Counter
import nltk
from nltk.corpus import stopwords


# Open text file to look for the most common words
tex = open("SchindlerListEng.txt","rt")

# List of not relevant words in english
stopwords = stopwords.words('english')

# Instantiate a dictionary, and for every word in the file. 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

for line in tex:
        for word in line.split():
            word = word.replace(".","")
            word = word.replace(",","")
            word = word.replace(":","")
            word = word.replace("\"","")
            word = word.replace("!","")
            word = word.replace("“","")
            word = word.replace("‘","")
            word = word.replace("*","")
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
            else:
                wordcount[word] += 1

# Remove words within brackets [] - Not sure how to do it

# Create a new list that contains punctuation we wish to clean. I also added new words to clean
punctuations = ['(',')',';',':','[',']',',','.','!','-','?',"''",'...','mr','mrs','one','two','said','PLASZOW','That','This','FACTORY',"'s",'And',"'ve","n't","'d",'What','There','EXT','INT','SCHINDLER','GOETH','STERN','A','one','three','I','The','He','It','DAY',"'re",'They','You',"'ll","'m",'``',"'S",'NIGHT','BRINNLITZ','As']

#Create a list comprehension that only returns a list of words that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tex if not word in stopwords and not word in punctuations]

#Count the number of times that a relevant word appear
kcounter = Counter(keywords)

#Show the top 100
print (kcounter.most_common(100))

#Export the top 100 counter to a table format ( Word | Count ) to a CSV file
The output I am getting is:
[Image: 9Wy6N.png]

An example of the output I would like to get - this is the one I got by analysing a PDF, that had similar content to my text file.
[Image: FLVb4.png]

But on top of that, I would like to export the output in a table format ( Word | Count ), but for that one, I have no idea of how to write the line of code.

About the packages

From what I studied for this project, NLTK is a package focused on providing functions to human language. Stopwords(text,'english') would locate all the words in the English language that are not very relant when evaluating a text, we can understand as "frequent words such as "I, It, the" and others.

Collections is the package to call the function counter(), that will count how many times each word appears in the text.

I am using Python 3.8

PS: I tried to post on Stackoverflow, but I am probably too much of a beginner, and the question wasn't quite suitable there.

I much appreciate any help!!
Reply
#2
import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)
I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.
Reply
#3
(Sep-20-2020, 04:34 PM)deanhystad Wrote:
import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)
I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.

Hi Dean, thank you very much!
I believe that will work perfectly. I will have to read more about regex to understand the difference.
Thanks again!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Form that puts diacritics on the words in the text Melcu54 13 1,385 Aug-22-2023, 07:07 AM
Last Post: Pedroski55
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,071 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  Python modules for accessing the configuration of relevant paths Imago 1 1,324 May-07-2022, 07:28 PM
Last Post: Larz60+
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,580 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Converted Pipe Delimited text file to CSV file atomxkai 4 6,843 Feb-11-2022, 12:38 AM
Last Post: atomxkai
  Extract a string between 2 words from a text file OscarBoots 2 1,827 Nov-02-2021, 08:50 AM
Last Post: ibreeden
  Generate a string of words for multiple lists of words in txt files in order. AnicraftPlayz 2 2,759 Aug-11-2021, 03:45 PM
Last Post: jamesaarr
  Open and read multiple text files and match words kozaizsvemira 3 6,677 Jul-07-2021, 11:27 AM
Last Post: Larz60+
  regex pattern to extract relevant sentences Bubly 2 1,837 Jul-06-2021, 04:17 PM
Last Post: Bubly
  Counting number of words and organize for the bigger frequencies to the small ones. valeriorsneto 1 1,648 Feb-05-2021, 03:49 PM
Last Post: perfringo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020