Analyzing large text file with nltk.corpus (stopwords )

Analyzing large text file with nltk.corpus (stopwords ) - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Analyzing large text file with nltk.corpus (stopwords ) (/thread-18785.html)

Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - May-31-2019

I’ve got a script which @snippsat helped me with previously which ranks the top 10 most commonly used words in a large public domain book such as Alice in Wonderland. The text file is attached to this forum post.

Here is the script as it appears now:

from collections import Counter
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   words = re.findall('\w+', text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

When I run the script, here is the correct, predictable and expected output:

Quote:$ python script.py
the --> 1818
and --> 940
to --> 809
a --> 690
of --> 631
it --> 610
she --> 553
i --> 543
you --> 481
said --> 462

Great!

Now I am trying to extend this script by filtering out all the common words (such as ‘the, ‘and’, ‘to’). I’m using a guide called “Using NLTK To Remove Stopwords From A Text File”. Here is my latest iteration of my script (with new code added at lines 3, 11 and 12):

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   words = re.findall('\w+', clean)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is my traceback:

Quote: python daniel_script6_with_snippsat.py
Traceback (most recent call last):
File "daniel_script6_with_snippsat.py", line 20, in <module>
main(text)
File "daniel_script6_with_snippsat.py", line 13, in main
words = re.findall('\w+', clean)
File "/usr/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

What is this traceback trying to say?

The list comprehension psychs me out. I’m not sure how to convert it back to a regular nested for loop. I’m not even sure if this is the issue. Here is my attempt at converting what I have at line 12 into a regular nested for loop:

clean = for word in text.split():
       if word not in stoplist:
            word +=

So I suppose my three questions are:

What is the traceback trying to say?
How would you people modify the script to successfully process stopwords so as to eliminate the ranking of common words?
How would you people re-write the list comprehension loop into a regular nested for loop?

RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - May-31-2019

1. clean is a list, and re wants a string.
2. Convert clean back to a string: (clean_text = ' '.join(clean)).
3. It's not a nested loop, it's a simple loop.

Note that data = [operation(item) for item in sequence if condtion] converts to:

data = []
for item in sequence:
    if condition:
        data.append(operation(item))

RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - May-31-2019

Thank you, @ichabod801! Converting clean into a string as you described eliminates the traceback. The script runs now. However common words are still showing up in the top 10. Here is the output:

Quote:$ python3 script6.py
said --> 462
alice --> 403
i --> 283
it --> 205
s --> 184
little --> 128
you --> 115
and --> 107
one --> 106
gutenberg --> 93

I’m not sure why. Could the issue be with the way I’ve joined my list items together? Any other ideas?

As a reminder so you don't need to go digging for it, here is the original list comprehension line: clean = [word for word in text.split() if word not in stoplist]. Here is your pseudo code:

data = []
for item in sequence:
    if condition:
        data.append(operation(item))

Here is my attempt at translating or transposing your helpful pseudo code onto the list comprenshion line above:

   
   clean = []
   for word in text.split():
       if word not in stoplist:
           clean.append # not sure what else to put here

I’m fairly confident Lines 2 + 3 are correct. But I’m not sure exactly how to form the final line at all. What else might you recommend, @ichabod801? I’m also not sure about whether it is necessary to declare clean as an empty list at line 1.

RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - Jun-01-2019

It could be the way you joined the words, but I'm not sure how you did that, so I don't know. It could be that the words are not what they appear (try printing the repr of the words), or stop words is not what you expect. I would do a check and see if those words actually are in stopwords.

As for the loop, you want clean.append(word). And you do need to start with an empty list. Otherwise you have nothing to append to.

RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-01-2019

(Jun-01-2019, 12:53 AM)ichabod801 Wrote: As for the loop, you want clean.append(word). And you do need to start with an empty list. Otherwise you have nothing to append to.

Thanks for the clarification.

Quote:It could be the way you joined the words, but I'm not sure how you did that, so I don't know.

I should have included my working script in my past post. I’m not sure how I missed that. I'm sorry for the confusion. Here it is now:

from collections import Counter
from nltk.corpus import stopwords
import re
def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text
def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   clean_text = ' '.join(clean)
   words = re.findall('\w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is the output again:

Quote:$ python 3 script6.py
said --> 462
alice --> 403
i --> 283
it --> 205
s --> 184
little --> 128
you --> 115
and --> 107
one --> 106
gutenberg --> 93

“it” I would think would be included in the stopword library for sure. I suspect the reason why is that it has to do with the splitting and rejoining of “it” and “s” for the instances of “it’s” as they appear in Alice and Wonderland. If you look at the output above, “and” comes in at number 8. That can’t be right. I can’t explain what is going on but perhaps you @ichabod801 can better assess the output now that I have included my updated script here.

Quote: It could be that the words are not what they appear (try printing the repr of the words), or stop words is not what you expect. I would do a check and see if those words actually are in stopwords.

I’ll look into the Python’s repr builtin for the words tomorrow. I’m off to bed for the evening.

Thank you @ichabod801 for you help and for your patience so far!

RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - Jun-01-2019

(Jun-01-2019, 03:04 AM)Drone4four Wrote: I’ll look into the Python’s repr builtin for the words tomorrow. I’m off to bed for the evening.

Look at the f-string syntax. Using print(f'{word!r:<4} {"-->":^4} {count:>4}') should work.

RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-02-2019

I substituted my f-string for the one you suggested. I noticed right away that the output was the same as before. It’s identical. So I thought I didn’t execute the correct file name in my shell. In the end I included both f-strings (the original and @ichabod801’s) on the same line separated by a string of three pipes. So here is what my line 17 looks like now:

print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')

The first word variable on the left includes !r where as the word variable on the right does not. Yet the output remains the same:

Quote: $ python3 script.py
'said' --> 462 ||| said --> 462
'alice' --> 403 ||| alice --> 403
'i' --> 283 ||| i --> 283
'it' --> 205 ||| it --> 205
's' --> 184 ||| s --> 184
'little' --> 128 ||| little --> 128
'you' --> 115 ||| you --> 115
'and' --> 107 ||| and --> 107
'one' --> 106 ||| one --> 106
'gutenberg' --> 93 ||| gutenberg --> 93

!r is not catching the apostrophes. I read the Python doc b]@ichabod801[/b] shared and I understand some of it. Apparently !r should filter out apostrophes that are part of a word in a string or set of strings (or in my case throughout a full length book). So my hypothesis was that the “s” and “it” and “it’s” to be removed from the output. Apparently I need a new hypothesis. I’m all out of ideas. What do you people think could the issue be here?

Remember: I’m trying to filter out all the common stopwords in a large text file so that the output should be the most commonly used nouns in the book Alice and Wonderland.

For what it’s worth, here is my script entirely so far up to this point:

from collections import Counter
from nltk.corpus import stopwords
import re
def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text
def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   clean_text = ' '.join(clean)
   words = re.findall('\w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Thanks again, @ichabod801.

RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - Jun-02-2019

No, !r is causing the apostrophes. What it's showing is how the computer sees the words, and that the words are exactly what was expected. The !r would show whitespace or non-printing characters we would not see in the standard string print form, but it's not showing any such thing. So the problem is not with the words you are counting.

You need to check stoplist to make sure it's what you think it is. You need to make sure the words you think shouldn't be in the output actually are in stoplist. Looking at the nltk book, they should be, so it's not clear to me what is going on. Print stoplist so we can be sure they're in there.

RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-03-2019

Here are the contents the generated list of stopwords:

Quote:['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

As you can see the very first list item is "i" which is inconsistent with the final output because it's still counting the number of instances of "i" in the Alice and Wonderland text file.

I added "said", "it", "and" and "s" to my stoplist variable using this line: stoplist.extend(["said", "i", "it", "you", "and","that",]). Even with the addition of these individual words, only "said" is omitted from the output.

@ichabod801: I appreciate your help up to this point. I know you've said that it's not clear to you what the issue could be here. Where else could I ask my question? Perhaps Stackoverflow would be a good next step.

Here is the output now:

Quote:$ python3 script.py
python daniel_script6_with_snippsat.py
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't", 'said', 'i', 'it', 'you', 'and', 'that']
'alice' --> 403 ||| alice --> 403
'i' --> 283 ||| i --> 283
'it' --> 205 ||| it --> 205
's' --> 184 ||| s --> 184
'little' --> 128 ||| little --> 128
'you' --> 115 ||| you --> 115
'and' --> 107 ||| and --> 107
'one' --> 106 ||| one --> 106
'gutenberg' --> 93 ||| gutenberg --> 93
'that' --> 92 ||| that --> 92

Here is what my script looks like now:

from collections import Counter
from nltk.corpus import stopwords
import re
def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   print(stoplist)
   stoplist.extend(["said", "i", "it", "you", "and","that",])
   print(stoplist)
   clean = [word for word in text.split() if word not in stoplist] #
   clean_text = ' '.join(clean)
   words = re.findall('\w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

For the sake of completeness, also find Alice.txt attached.

RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-06-2019

I asked on Stack Overflow in a question I titled: “Filtering stop words out of a large text file (using package: nltk.corpus)". Within about an hour, I got more than one answer. This is unique because in my past experience with SO, my question usually got locked right away for being duplicates or violating any of the other stupid terms of service on SO. So I’m surprised my question actually has merit. Anyways, based on the two answers I received there, I’ve got an updated working script which now filters out "i", "it", "you", "and","that" properly. But I am not sure why. I’ve got some more follow up questions on Python syntax but can’t ask those questions on SO, so I’ll ask them here on Python Forum.

Here is my working script as it appears today:

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english')
   stoplist.extend(["said","gutenberg", "could", "would",])
   # clean = [word.lower() for word in re.split(r"\W+",text) if word not in stoplist]
   clean = []
   for word in re.split(r"\W+",text):
       if word not in stoplist:
           clean.append(word)
   top_10 = Counter(clean).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is my attempt at explaining in cerebral, readable english prose, each line of my Python script above:

Quote:1. From the collections library, import the Counter class
2. From inside the natural language toolkit, import the corpus package, specifically the stopwords function (or method?)
3. Import the regex package (which is a Python builtin)
4. Whitespace
5. This lines defines and declaring the the open_file function
6. Open the Alice and Wonderland text file as variable f.
7. Declare a text variable by reading and lower casing the contents of f
8. Enable the text variable to be returned whenever open_file() is called in the future
9 Whitespace
10. This lines defines and declaring the the main function with the text variable submitted as a parameter
11. Declare a variable called stoplist based on the english list of words found within the stopwords method (or is this a function) imported earlier?
12. The list of stopwords (part of the stoplist variable) is extended by adding words such as: "said","gutenberg", "could", "would"
13. This line (commented out) is a highly advanced loop in list comprehension form which is beyond my novice understanding at this point, so I rewrote it and spread it out as a simple loop in the following 4 lines
14. Declare an empty list as a variable called clean
15. For every word inside a text variable (split into a list)
16. If the word is not inside the stoplist, then:
17. Append the word into the clean variable (a list). This effectively re-creates the book in list form without the most commonly used words (found inside stoplist generated earlier)
18. Declare a new variable called top_10 which represents the clean list of words submitted for counting inside the Counter class, while
19. For the two iterators inside the top_10 variable:
20. Print the first iterable (word) and print the second iterable (count) divided with a nice --> in between the two
21. Whitespace
22. Just a silly programming convention
23. Declaring the text variable based on the variable created (returned) in the open_file() function
24. Calling the main function with the newly create text variable

Here are my questions for all of you at this point:

1. Could someone on this forum verify the accuracy of my above cerebral, english interpretation of my Python script? I suppose the two problem areas for me that I can identify are lines 15 and 20.
2. Python’s official regex doc may as well be chinese to me because it’s written by programmers for programmers, not for novices such as myself. On Google I came across a related SO question titled: “Python regex - (\w+) results different output when used with complex expression” but I don’t understand this either. Would someone kindly explain what r”\W+” does at line 15?
3. Why were we previously re-joining the list of clean text with clean_text = ' '.join(clean) and then finding all instances of “\W+’ in text? I believe this was @ichabod801’s initial suggestion earlier.

Here is my final, successful output:

Quote:'alice' --> 403
'little' --> 128
'one' --> 106
'know' --> 88
'project' --> 87
'like' --> 85
'went' --> 83
'queen' --> 75
'thought' --> 74
'time' --> 71

Here @snippsat managed to get the output to centre align using an awesome technique available with f-strings. I’ve used @snippsat’s f-string line verbatim yet my output looks messy and inconsistent. Here is a screenshot of my terminal showing my output on top (formatted incorrectly) with @snippsat’s output at the bottom (formatted correctly):

[Image: ls0JC8J.jpg]

How can I make my output more consistent and appear tidier like @snippsat’s?