Analyzing large text file with nltk.corpus (stopwords )

Drone4four · (This post was last modified: Jun-06-2019, 09:30 PM by Drone4four.)

I asked on Stack Overflow in a question I titled: “Filtering stop words out of a large text file (using package: nltk.corpus)". Within about an hour, I got more than one answer. This is unique because in my past experience with SO, my question usually got locked right away for being duplicates or violating any of the other stupid terms of service on SO. So I’m surprised my question actually has merit. Anyways, based on the two answers I received there, I’ve got an updated working script which now filters out "i", "it", "you", "and","that" properly. But I am not sure why. I’ve got some more follow up questions on Python syntax but can’t ask those questions on SO, so I’ll ask them here on Python Forum.

Here is my working script as it appears today:

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english')
   stoplist.extend(["said","gutenberg", "could", "would",])
   # clean = [word.lower() for word in re.split(r"\W+",text) if word not in stoplist]
   clean = []
   for word in re.split(r"\W+",text):
       if word not in stoplist:
           clean.append(word)
   top_10 = Counter(clean).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is my attempt at explaining in cerebral, readable english prose, each line of my Python script above:

Quote:1. From the collections library, import the Counter class
2. From inside the natural language toolkit, import the corpus package, specifically the stopwords function (or method?)
3. Import the regex package (which is a Python builtin)
4. Whitespace
5. This lines defines and declaring the the open_file function
6. Open the Alice and Wonderland text file as variable f.
7. Declare a text variable by reading and lower casing the contents of f
8. Enable the text variable to be returned whenever open_file() is called in the future
9 Whitespace
10. This lines defines and declaring the the main function with the text variable submitted as a parameter
11. Declare a variable called stoplist based on the english list of words found within the stopwords method (or is this a function) imported earlier?
12. The list of stopwords (part of the stoplist variable) is extended by adding words such as: "said","gutenberg", "could", "would"
13. This line (commented out) is a highly advanced loop in list comprehension form which is beyond my novice understanding at this point, so I rewrote it and spread it out as a simple loop in the following 4 lines
14. Declare an empty list as a variable called clean
15. For every word inside a text variable (split into a list)
16. If the word is not inside the stoplist, then:
17. Append the word into the clean variable (a list). This effectively re-creates the book in list form without the most commonly used words (found inside stoplist generated earlier)
18. Declare a new variable called top_10 which represents the clean list of words submitted for counting inside the Counter class, while
19. For the two iterators inside the top_10 variable:
20. Print the first iterable (word) and print the second iterable (count) divided with a nice --> in between the two
21. Whitespace
22. Just a silly programming convention
23. Declaring the text variable based on the variable created (returned) in the open_file() function
24. Calling the main function with the newly create text variable

Here are my questions for all of you at this point:

1. Could someone on this forum verify the accuracy of my above cerebral, english interpretation of my Python script? I suppose the two problem areas for me that I can identify are lines 15 and 20.
2. Python’s official regex doc may as well be chinese to me because it’s written by programmers for programmers, not for novices such as myself. On Google I came across a related SO question titled: “Python regex - (\w+) results different output when used with complex expression” but I don’t understand this either. Would someone kindly explain what r”\W+” does at line 15?
3. Why were we previously re-joining the list of clean text with clean_text = ' '.join(clean) and then finding all instances of “\W+’ in text? I believe this was @ichabod801’s initial suggestion earlier.

Here is my final, successful output:

Quote:'alice' --> 403
'little' --> 128
'one' --> 106
'know' --> 88
'project' --> 87
'like' --> 85
'went' --> 83
'queen' --> 75
'thought' --> 74
'time' --> 71

Here @snippsat managed to get the output to centre align using an awesome technique available with f-strings. I’ve used @snippsat’s f-string line verbatim yet my output looks messy and inconsistent. Here is a screenshot of my terminal showing my output on top (formatted incorrectly) with @snippsat’s output at the bottom (formatted correctly):

[Image: ls0JC8J.jpg]

How can I make my output more consistent and appear tidier like @snippsat’s?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Converted EXE file size is too large	Rajasekaran	0	1,517	Mar-30-2023, 11:50 AM Last Post: Rajasekaran
	Need to compare the Excel file name with a directory text file.	veeran1991	1	1,124	Dec-15-2022, 04:32 PM Last Post: Larz60+
	validate large json file with millions of records in batches	herobpv	3	1,275	Dec-10-2022, 10:36 PM Last Post: bowlofred
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	1,673	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Pyinstaller distribution file seems too large	hammer	4	2,723	Mar-31-2022, 02:33 PM Last Post: snippsat
	Initializing, reading and updating a large JSON file	medatib531	0	1,775	Mar-10-2022, 07:58 PM Last Post: medatib531
	Converted Pipe Delimited text file to CSV file	atomxkai	4	6,983	Feb-11-2022, 12:38 AM Last Post: atomxkai
	Help with simple nltk Chatbot	Extra	3	1,887	Jan-02-2022, 07:50 AM Last Post: bepammoifoge
	[split] How to convert the CSV text file into a txt file	Pinto94	5	3,355	Dec-23-2020, 08:04 AM Last Post: ndc85430
	Saving a download of stopwords (nltk)	Drone4four	1	9,304	Nov-19-2020, 11:50 PM Last Post: snippsat

Analyzing large text file with nltk.corpus (stopwords )

User Panel Messages

Announcements