Python Forum
Counting the most relevant words in a text file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Counting the most relevant words in a text file
#1
I am totally fresh to Python and recently I have been struggling to read a text file. As a personal studying project.

I have "built" this code based on some examples I saw only, but it is just not quite working. I am describing on the code, what the goal would be for each line. Does anyone know why is it not working? And also how could I fill the empty lines?

Here it follows my code:

 
import re
import collections
from collections import Counter
import nltk
from nltk.corpus import stopwords


# Open text file to look for the most common words
tex = open("SchindlerListEng.txt","rt")

# List of not relevant words in english
stopwords = stopwords.words('english')

# Instantiate a dictionary, and for every word in the file. 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

for line in tex:
        for word in line.split():
            word = word.replace(".","")
            word = word.replace(",","")
            word = word.replace(":","")
            word = word.replace("\"","")
            word = word.replace("!","")
            word = word.replace("“","")
            word = word.replace("‘","")
            word = word.replace("*","")
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
            else:
                wordcount[word] += 1

# Remove words within brackets [] - Not sure how to do it

# Create a new list that contains punctuation we wish to clean. I also added new words to clean
punctuations = ['(',')',';',':','[',']',',','.','!','-','?',"''",'...','mr','mrs','one','two','said','PLASZOW','That','This','FACTORY',"'s",'And',"'ve","n't","'d",'What','There','EXT','INT','SCHINDLER','GOETH','STERN','A','one','three','I','The','He','It','DAY',"'re",'They','You',"'ll","'m",'``',"'S",'NIGHT','BRINNLITZ','As']

#Create a list comprehension that only returns a list of words that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tex if not word in stopwords and not word in punctuations]

#Count the number of times that a relevant word appear
kcounter = Counter(keywords)

#Show the top 100
print (kcounter.most_common(100))

#Export the top 100 counter to a table format ( Word | Count ) to a CSV file
The output I am getting is:
[Image: 9Wy6N.png]

An example of the output I would like to get - this is the one I got by analysing a PDF, that had similar content to my text file.
[Image: FLVb4.png]

But on top of that, I would like to export the output in a table format ( Word | Count ), but for that one, I have no idea of how to write the line of code.

About the packages

From what I studied for this project, NLTK is a package focused on providing functions to human language. Stopwords(text,'english') would locate all the words in the English language that are not very relant when evaluating a text, we can understand as "frequent words such as "I, It, the" and others.

Collections is the package to call the function counter(), that will count how many times each word appears in the text.

I am using Python 3.8

PS: I tried to post on Stackoverflow, but I am probably too much of a beginner, and the question wasn't quite suitable there.

I much appreciate any help!!
Reply
#2
import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)
I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.
Reply
#3
(Sep-20-2020, 04:34 PM)deanhystad Wrote:
import collections
 
punctuation = [' ', '(', ')', ';', ':', '[' ,']', ',', '.', '!', '-', '?', "'", '"']
stopwords = ['', 'a', 'A', 'I', 'am', 'the', 'he', 'her']
wordcount = collections.Counter()
with open("debug.txt", "rt") as tex:
    for line in tex:
        for word in line.split():
            for p in punctuation:
                word = word.strip(p)
            if word not in stopwords:
                wordcount.update([word])

print(wordcount)
I think regex would be a better choice for stripping out punctuation. An exercise left for the reader.

Hi Dean, thank you very much!
I believe that will work perfectly. I will have to read more about regex to understand the difference.
Thanks again!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Generate a string of words for multiple lists of words in txt files in order. AnicraftPlayz 2 406 Aug-11-2021, 03:45 PM
Last Post: jamesaarr
  Open and read multiple text files and match words kozaizsvemira 3 4,281 Jul-07-2021, 11:27 AM
Last Post: Larz60+
  regex pattern to extract relevant sentences Bubly 2 478 Jul-06-2021, 04:17 PM
Last Post: Bubly
  Counting number of words and organize for the bigger frequencies to the small ones. valeriorsneto 1 470 Feb-05-2021, 03:49 PM
Last Post: perfringo
  [split] How to convert the CSV text file into a txt file Pinto94 5 1,050 Dec-23-2020, 08:04 AM
Last Post: ndc85430
  Saving text file with a click: valueerror i/o operation on closed file vizier87 5 1,433 Nov-16-2020, 07:56 AM
Last Post: Gribouillis
  saving data from text file to CSV file in python having delimiter as space K11 1 775 Sep-11-2020, 06:28 AM
Last Post: bowlofred
  Web Form to Python Script to Text File to zip file to web wfsteadman 1 891 Aug-09-2020, 02:12 PM
Last Post: snippsat
  Check text contains words similar to themes/topics (thesaurus) Bec 1 854 Jul-28-2020, 04:17 PM
Last Post: Larz60+
  Convert Excel file to Text file marvel_plato 6 5,502 Jul-17-2020, 01:45 PM
Last Post: marvel_plato

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020