Jun-06-2019, 09:30 PM
(This post was last modified: Jun-06-2019, 09:30 PM by Drone4four.)
I asked on Stack Overflow in a question I titled: “Filtering stop words out of a large text file (using package: nltk.corpus)". Within about an hour, I got more than one answer. This is unique because in my past experience with SO, my question usually got locked right away for being duplicates or violating any of the other stupid terms of service on SO. So I’m surprised my question actually has merit. Anyways, based on the two answers I received there, I’ve got an updated working script which now filters out "i", "it", "you", "and","that" properly. But I am not sure why. I’ve got some more follow up questions on Python syntax but can’t ask those questions on SO, so I’ll ask them here on Python Forum.
Here is my working script as it appears today:
Here are my questions for all of you at this point:
1. Could someone on this forum verify the accuracy of my above cerebral, english interpretation of my Python script? I suppose the two problem areas for me that I can identify are lines 15 and 20.
2. Python’s official regex doc may as well be chinese to me because it’s written by programmers for programmers, not for novices such as myself. On Google I came across a related SO question titled: “Python regex - (\w+) results different output when used with complex expression” but I don’t understand this either. Would someone kindly explain what
3. Why were we previously re-joining the list of clean text with
Here is my final, successful output:
Here @snippsat managed to get the output to centre align using an awesome technique available with f-strings. I’ve used @snippsat’s f-string line verbatim yet my output looks messy and inconsistent. Here is a screenshot of my terminal showing my output on top (formatted incorrectly) with @snippsat’s output at the bottom (formatted correctly):
How can I make my output more consistent and appear tidier like @snippsat’s?
Here is my working script as it appears today:
from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') stoplist.extend(["said","gutenberg", "could", "would",]) # clean = [word.lower() for word in re.split(r"\W+",text) if word not in stoplist] clean = [] for word in re.split(r"\W+",text): if word not in stoplist: clean.append(word) top_10 = Counter(clean).most_common(10) for word,count in top_10: print(f'{word!r:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)Here is my attempt at explaining in cerebral, readable english prose, each line of my Python script above:
Quote:1. From the collections library, import the Counter class
2. From inside the natural language toolkit, import the corpus package, specifically the stopwords function (or method?)
3. Import the regex package (which is a Python builtin)
4. Whitespace
5. This lines defines and declaring the the open_file function
6. Open the Alice and Wonderland text file as variablef
.
7. Declare atext
variable by reading and lower casing the contents off
8. Enable thetext
variable to be returned wheneveropen_file()
is called in the future
9 Whitespace
10. This lines defines and declaring the the main function with thetext
variable submitted as a parameter
11. Declare a variable calledstoplist
based on the english list of words found within the stopwords method (or is this a function) imported earlier?
12. The list of stopwords (part of the stoplist variable) is extended by adding words such as: "said","gutenberg", "could", "would"
13. This line (commented out) is a highly advanced loop in list comprehension form which is beyond my novice understanding at this point, so I rewrote it and spread it out as a simple loop in the following 4 lines
14. Declare an empty list as a variable calledclean
15. For every word inside a text variable (split into a list)
16. If the word is not inside the stoplist, then:
17. Append the word into theclean
variable (a list). This effectively re-creates the book in list form without the most commonly used words (found insidestoplist
generated earlier)
18. Declare a new variable calledtop_10
which represents theclean
list of words submitted for counting inside theCounter
class, while
19. For the two iterators inside thetop_10
variable:
20. Print the first iterable (word
) and print the second iterable (count
) divided with a nice-->
in between the two
21. Whitespace
22. Just a silly programming convention
23. Declaring thetext
variable based on the variable created (returned) in the open_file() function
24. Calling the main function with the newly createtext
variable
Here are my questions for all of you at this point:
1. Could someone on this forum verify the accuracy of my above cerebral, english interpretation of my Python script? I suppose the two problem areas for me that I can identify are lines 15 and 20.
2. Python’s official regex doc may as well be chinese to me because it’s written by programmers for programmers, not for novices such as myself. On Google I came across a related SO question titled: “Python regex - (\w+) results different output when used with complex expression” but I don’t understand this either. Would someone kindly explain what
r”\W+”
does at line 15?3. Why were we previously re-joining the list of clean text with
clean_text = ' '.join(clean)
and then finding all instances of “\W+’
in text
? I believe this was @ichabod801’s initial suggestion earlier.Here is my final, successful output:
Quote:'alice' --> 403
'little' --> 128
'one' --> 106
'know' --> 88
'project' --> 87
'like' --> 85
'went' --> 83
'queen' --> 75
'thought' --> 74
'time' --> 71
Here @snippsat managed to get the output to centre align using an awesome technique available with f-strings. I’ve used @snippsat’s f-string line verbatim yet my output looks messy and inconsistent. Here is a screenshot of my terminal showing my output on top (formatted incorrectly) with @snippsat’s output at the bottom (formatted correctly):
How can I make my output more consistent and appear tidier like @snippsat’s?