Analyzing large text file with nltk.corpus (stopwords ) - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Analyzing large text file with nltk.corpus (stopwords ) (/thread-18785.html) |
Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - May-31-2019 I’ve got a script which @snippsat helped me with previously which ranks the top 10 most commonly used words in a large public domain book such as Alice in Wonderland. The text file is attached to this forum post. Here is the script as it appears now: from collections import Counter import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): words = re.findall('\w+', text) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)When I run the script, here is the correct, predictable and expected output: Quote:$ python script.py Great! Now I am trying to extend this script by filtering out all the common words (such as ‘the, ‘and’, ‘to’). I’m using a guide called “Using NLTK To Remove Stopwords From A Text File”. Here is my latest iteration of my script (with new code added at lines 3, 11 and 12): from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') # Bring in the default English NLTK stop words clean = [word for word in text.split() if word not in stoplist] # words = re.findall('\w+', clean) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)Here is my traceback: Quote: python daniel_script6_with_snippsat.py What is this traceback trying to say? The list comprehension psychs me out. I’m not sure how to convert it back to a regular nested for loop. I’m not even sure if this is the issue. Here is my attempt at converting what I have at line 12 into a regular nested for loop: clean = for word in text.split(): if word not in stoplist: word +=So I suppose my three questions are:
RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - May-31-2019 1. clean is a list, and re wants a string. 2. Convert clean back to a string: ( clean_text = ' '.join(clean) ).3. It's not a nested loop, it's a simple loop. Note that data = [operation(item) for item in sequence if condtion] converts to:data = [] for item in sequence: if condition: data.append(operation(item)) RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - May-31-2019 Thank you, @ichabod801! Converting clean into a string as you described eliminates the traceback. The script runs now. However common words are still showing up in the top 10. Here is the output:Quote:$ python3 script6.py I’m not sure why. Could the issue be with the way I’ve joined my list items together? Any other ideas? As a reminder so you don't need to go digging for it, here is the original list comprehension line: clean = [word for word in text.split() if word not in stoplist] . Here is your pseudo code:data = [] for item in sequence: if condition: data.append(operation(item))Here is my attempt at translating or transposing your helpful pseudo code onto the list comprenshion line above: clean = [] for word in text.split(): if word not in stoplist: clean.append # not sure what else to put hereI’m fairly confident Lines 2 + 3 are correct. But I’m not sure exactly how to form the final line at all. What else might you recommend, @ichabod801? I’m also not sure about whether it is necessary to declare clean as an empty list at line 1.
RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - Jun-01-2019 It could be the way you joined the words, but I'm not sure how you did that, so I don't know. It could be that the words are not what they appear (try printing the repr of the words), or stop words is not what you expect. I would do a check and see if those words actually are in stopwords. As for the loop, you want clean.append(word) . And you do need to start with an empty list. Otherwise you have nothing to append to.
RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-01-2019 (Jun-01-2019, 12:53 AM)ichabod801 Wrote: As for the loop, you want Thanks for the clarification. Quote:It could be the way you joined the words, but I'm not sure how you did that, so I don't know. I should have included my working script in my past post. I’m not sure how I missed that. I'm sorry for the confusion. Here it is now: from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') # Bring in the default English NLTK stop words clean = [word for word in text.split() if word not in stoplist] # clean_text = ' '.join(clean) words = re.findall('\w+', clean_text) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)Here is the output again: Quote:$ python 3 script6.py “it” I would think would be included in the stopword library for sure. I suspect the reason why is that it has to do with the splitting and rejoining of “it” and “s” for the instances of “it’s” as they appear in Alice and Wonderland. If you look at the output above, “and” comes in at number 8. That can’t be right. I can’t explain what is going on but perhaps you @ichabod801 can better assess the output now that I have included my updated script here. Quote: It could be that the words are not what they appear (try printing the repr of the words), or stop words is not what you expect. I would do a check and see if those words actually are in stopwords. I’ll look into the Python’s repr builtin for the words tomorrow. I’m off to bed for the evening. Thank you @ichabod801 for you help and for your patience so far! RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - Jun-01-2019 (Jun-01-2019, 03:04 AM)Drone4four Wrote: I’ll look into the Python’s repr builtin for the words tomorrow. I’m off to bed for the evening. Look at the f-string syntax. Using print(f'{word!r:<4} {"-->":^4} {count:>4}') should work.
RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-02-2019 I substituted my f-string for the one you suggested. I noticed right away that the output was the same as before. It’s identical. So I thought I didn’t execute the correct file name in my shell. In the end I included both f-strings (the original and @ichabod801’s) on the same line separated by a string of three pipes. So here is what my line 17 looks like now: print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')The first word variable on the left includes !r where as the word variable on the right does not. Yet the output remains the same:Quote: $ python3 script.py !r is not catching the apostrophes. I read the Python doc b]@ichabod801[/b] shared and I understand some of it. Apparently !r should filter out apostrophes that are part of a word in a string or set of strings (or in my case throughout a full length book). So my hypothesis was that the “s” and “it” and “it’s” to be removed from the output. Apparently I need a new hypothesis. I’m all out of ideas. What do you people think could the issue be here?Remember: I’m trying to filter out all the common stopwords in a large text file so that the output should be the most commonly used nouns in the book Alice and Wonderland. For what it’s worth, here is my script entirely so far up to this point: from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') # Bring in the default English NLTK stop words clean = [word for word in text.split() if word not in stoplist] # clean_text = ' '.join(clean) words = re.findall('\w+', clean_text) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)Thanks again, @ichabod801. RE: Analyzing large text file with nltk.corpus (stopwords ) - ichabod801 - Jun-02-2019 No, !r is causing the apostrophes. What it's showing is how the computer sees the words, and that the words are exactly what was expected. The !r would show whitespace or non-printing characters we would not see in the standard string print form, but it's not showing any such thing. So the problem is not with the words you are counting. You need to check stoplist to make sure it's what you think it is. You need to make sure the words you think shouldn't be in the output actually are in stoplist. Looking at the nltk book, they should be, so it's not clear to me what is going on. Print stoplist so we can be sure they're in there. RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-03-2019 Here are the contents the generated list of stopwords: Quote:['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] As you can see the very first list item is "i" which is inconsistent with the final output because it's still counting the number of instances of "i" in the Alice and Wonderland text file. I added "said", "it", "and" and "s" to my stoplist variable using this line: stoplist.extend(["said", "i", "it", "you", "and","that",]) . Even with the addition of these individual words, only "said" is omitted from the output. @ichabod801: I appreciate your help up to this point. I know you've said that it's not clear to you what the issue could be here. Where else could I ask my question? Perhaps Stackoverflow would be a good next step. Here is the output now: Quote:$ python3 script.py Here is what my script looks like now: from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') # Bring in the default English NLTK stop words print(stoplist) stoplist.extend(["said", "i", "it", "you", "and","that",]) print(stoplist) clean = [word for word in text.split() if word not in stoplist] # clean_text = ' '.join(clean) words = re.findall('\w+', clean_text) top_10 = Counter(words).most_common(10) for word,count in top_10: print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)For the sake of completeness, also find Alice.txt attached. RE: Analyzing large text file with nltk.corpus (stopwords ) - Drone4four - Jun-06-2019 I asked on Stack Overflow in a question I titled: “Filtering stop words out of a large text file (using package: nltk.corpus)". Within about an hour, I got more than one answer. This is unique because in my past experience with SO, my question usually got locked right away for being duplicates or violating any of the other stupid terms of service on SO. So I’m surprised my question actually has merit. Anyways, based on the two answers I received there, I’ve got an updated working script which now filters out "i", "it", "you", "and","that" properly. But I am not sure why. I’ve got some more follow up questions on Python syntax but can’t ask those questions on SO, so I’ll ask them here on Python Forum. Here is my working script as it appears today: from collections import Counter from nltk.corpus import stopwords import re def open_file(): with open('Alice.txt') as f: text = f.read().lower() return text def main(text): stoplist = stopwords.words('english') stoplist.extend(["said","gutenberg", "could", "would",]) # clean = [word.lower() for word in re.split(r"\W+",text) if word not in stoplist] clean = [] for word in re.split(r"\W+",text): if word not in stoplist: clean.append(word) top_10 = Counter(clean).most_common(10) for word,count in top_10: print(f'{word!r:<4} {"-->":^4} {count:>4}') if __name__ == "__main__": text = open_file() main(text)Here is my attempt at explaining in cerebral, readable english prose, each line of my Python script above: Quote:1. From the collections library, import the Counter class Here are my questions for all of you at this point: 1. Could someone on this forum verify the accuracy of my above cerebral, english interpretation of my Python script? I suppose the two problem areas for me that I can identify are lines 15 and 20. 2. Python’s official regex doc may as well be chinese to me because it’s written by programmers for programmers, not for novices such as myself. On Google I came across a related SO question titled: “Python regex - (\w+) results different output when used with complex expression” but I don’t understand this either. Would someone kindly explain what r”\W+” does at line 15?3. Why were we previously re-joining the list of clean text with clean_text = ' '.join(clean) and then finding all instances of “\W+’ in text ? I believe this was @ichabod801’s initial suggestion earlier.Here is my final, successful output: Quote:'alice' --> 403 Here @snippsat managed to get the output to centre align using an awesome technique available with f-strings. I’ve used @snippsat’s f-string line verbatim yet my output looks messy and inconsistent. Here is a screenshot of my terminal showing my output on top (formatted incorrectly) with @snippsat’s output at the bottom (formatted correctly): How can I make my output more consistent and appear tidier like @snippsat’s? |