Sep-20-2020, 02:01 PM
(This post was last modified: Sep-20-2020, 02:02 PM by caiomartins.)
I am totally fresh to Python and recently I have been struggling to read a text file. As a personal studying project.
I have "built" this code based on some examples I saw only, but it is just not quite working. I am describing on the code, what the goal would be for each line. Does anyone know why is it not working? And also how could I fill the empty lines?
Here it follows my code:
![[Image: 9Wy6N.png]](https://i.stack.imgur.com/9Wy6N.png)
An example of the output I would like to get - this is the one I got by analysing a PDF, that had similar content to my text file.
![[Image: FLVb4.png]](https://i.stack.imgur.com/FLVb4.png)
But on top of that, I would like to export the output in a table format ( Word | Count ), but for that one, I have no idea of how to write the line of code.
About the packages
From what I studied for this project, NLTK is a package focused on providing functions to human language. Stopwords(text,'english') would locate all the words in the English language that are not very relant when evaluating a text, we can understand as "frequent words such as "I, It, the" and others.
Collections is the package to call the function counter(), that will count how many times each word appears in the text.
I am using Python 3.8
PS: I tried to post on Stackoverflow, but I am probably too much of a beginner, and the question wasn't quite suitable there.
I much appreciate any help!!
I have "built" this code based on some examples I saw only, but it is just not quite working. I am describing on the code, what the goal would be for each line. Does anyone know why is it not working? And also how could I fill the empty lines?
Here it follows my code:
import re import collections from collections import Counter import nltk from nltk.corpus import stopwords # Open text file to look for the most common words tex = open("SchindlerListEng.txt","rt") # List of not relevant words in english stopwords = stopwords.words('english') # Instantiate a dictionary, and for every word in the file. # Add to the dictionary if it doesn't exist. If it does, increase the count. wordcount = {} for line in tex: for word in line.split(): word = word.replace(".","") word = word.replace(",","") word = word.replace(":","") word = word.replace("\"","") word = word.replace("!","") word = word.replace("“","") word = word.replace("‘","") word = word.replace("*","") if word not in stopwords: if word not in wordcount: wordcount[word] = 1 else: wordcount[word] += 1 # Remove words within brackets [] - Not sure how to do it # Create a new list that contains punctuation we wish to clean. I also added new words to clean punctuations = ['(',')',';',':','[',']',',','.','!','-','?',"''",'...','mr','mrs','one','two','said','PLASZOW','That','This','FACTORY',"'s",'And',"'ve","n't","'d",'What','There','EXT','INT','SCHINDLER','GOETH','STERN','A','one','three','I','The','He','It','DAY',"'re",'They','You',"'ll","'m",'``',"'S",'NIGHT','BRINNLITZ','As'] #Create a list comprehension that only returns a list of words that are NOT IN stop_words and NOT IN punctuations. keywords = [word for word in tex if not word in stopwords and not word in punctuations] #Count the number of times that a relevant word appear kcounter = Counter(keywords) #Show the top 100 print (kcounter.most_common(100)) #Export the top 100 counter to a table format ( Word | Count ) to a CSV fileThe output I am getting is:
![[Image: 9Wy6N.png]](https://i.stack.imgur.com/9Wy6N.png)
An example of the output I would like to get - this is the one I got by analysing a PDF, that had similar content to my text file.
![[Image: FLVb4.png]](https://i.stack.imgur.com/FLVb4.png)
But on top of that, I would like to export the output in a table format ( Word | Count ), but for that one, I have no idea of how to write the line of code.
About the packages
From what I studied for this project, NLTK is a package focused on providing functions to human language. Stopwords(text,'english') would locate all the words in the English language that are not very relant when evaluating a text, we can understand as "frequent words such as "I, It, the" and others.
Collections is the package to call the function counter(), that will count how many times each word appears in the text.
I am using Python 3.8
PS: I tried to post on Stackoverflow, but I am probably too much of a beginner, and the question wasn't quite suitable there.
I much appreciate any help!!