Python Forum
Analyzing large text file with nltk.corpus (stopwords )
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Analyzing large text file with nltk.corpus (stopwords )
#10
I asked on Stack Overflow in a question I titled: “Filtering stop words out of a large text file (using package: nltk.corpus)". Within about an hour, I got more than one answer. This is unique because in my past experience with SO, my question usually got locked right away for being duplicates or violating any of the other stupid terms of service on SO. So I’m surprised my question actually has merit. Anyways, based on the two answers I received there, I’ve got an updated working script which now filters out "i", "it", "you", "and","that" properly. But I am not sure why. I’ve got some more follow up questions on Python syntax but can’t ask those questions on SO, so I’ll ask them here on Python Forum.

Here is my working script as it appears today:

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english')
   stoplist.extend(["said","gutenberg", "could", "would",])
   # clean = [word.lower() for word in re.split(r"\W+",text) if word not in stoplist]
   clean = []
   for word in re.split(r"\W+",text):
       if word not in stoplist:
           clean.append(word)
   top_10 = Counter(clean).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)
Here is my attempt at explaining in cerebral, readable english prose, each line of my Python script above:

Quote:1. From the collections library, import the Counter class
2. From inside the natural language toolkit, import the corpus package, specifically the stopwords function (or method?)
3. Import the regex package (which is a Python builtin)
4. Whitespace
5. This lines defines and declaring the the open_file function
6. Open the Alice and Wonderland text file as variable f.
7. Declare a text variable by reading and lower casing the contents of f
8. Enable the text variable to be returned whenever open_file() is called in the future
9 Whitespace
10. This lines defines and declaring the the main function with the text variable submitted as a parameter
11. Declare a variable called stoplist based on the english list of words found within the stopwords method (or is this a function) imported earlier?
12. The list of stopwords (part of the stoplist variable) is extended by adding words such as: "said","gutenberg", "could", "would"
13. This line (commented out) is a highly advanced loop in list comprehension form which is beyond my novice understanding at this point, so I rewrote it and spread it out as a simple loop in the following 4 lines
14. Declare an empty list as a variable called clean
15. For every word inside a text variable (split into a list)
16. If the word is not inside the stoplist, then:
17. Append the word into the clean variable (a list). This effectively re-creates the book in list form without the most commonly used words (found inside stoplist generated earlier)
18. Declare a new variable called top_10 which represents the clean list of words submitted for counting inside the Counter class, while
19. For the two iterators inside the top_10 variable:
20. Print the first iterable (word) and print the second iterable (count) divided with a nice --> in between the two
21. Whitespace
22. Just a silly programming convention
23. Declaring the text variable based on the variable created (returned) in the open_file() function
24. Calling the main function with the newly create text variable

Here are my questions for all of you at this point:

1. Could someone on this forum verify the accuracy of my above cerebral, english interpretation of my Python script? I suppose the two problem areas for me that I can identify are lines 15 and 20.
2. Python’s official regex doc may as well be chinese to me because it’s written by programmers for programmers, not for novices such as myself. On Google I came across a related SO question titled: “Python regex - (\w+) results different output when used with complex expression” but I don’t understand this either. Would someone kindly explain what r”\W+” does at line 15?
3. Why were we previously re-joining the list of clean text with clean_text = ' '.join(clean) and then finding all instances of “\W+’ in text? I believe this was @ichabod801’s initial suggestion earlier.

Here is my final, successful output:

Quote:'alice' --> 403
'little' --> 128
'one' --> 106
'know' --> 88
'project' --> 87
'like' --> 85
'went' --> 83
'queen' --> 75
'thought' --> 74
'time' --> 71

Here @snippsat managed to get the output to centre align using an awesome technique available with f-strings. I’ve used @snippsat’s f-string line verbatim yet my output looks messy and inconsistent. Here is a screenshot of my terminal showing my output on top (formatted incorrectly) with @snippsat’s output at the bottom (formatted correctly):

[Image: ls0JC8J.jpg]

How can I make my output more consistent and appear tidier like @snippsat’s?
Reply


Messages In This Thread
RE: Analyzing large text file with nltk.corpus (stopwords ) - by Drone4four - Jun-06-2019, 09:30 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Converted EXE file size is too large Rajasekaran 0 1,517 Mar-30-2023, 11:50 AM
Last Post: Rajasekaran
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,124 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  validate large json file with millions of records in batches herobpv 3 1,275 Dec-10-2022, 10:36 PM
Last Post: bowlofred
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,673 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Pyinstaller distribution file seems too large hammer 4 2,723 Mar-31-2022, 02:33 PM
Last Post: snippsat
  Initializing, reading and updating a large JSON file medatib531 0 1,775 Mar-10-2022, 07:58 PM
Last Post: medatib531
  Converted Pipe Delimited text file to CSV file atomxkai 4 6,983 Feb-11-2022, 12:38 AM
Last Post: atomxkai
  Help with simple nltk Chatbot Extra 3 1,887 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  [split] How to convert the CSV text file into a txt file Pinto94 5 3,355 Dec-23-2020, 08:04 AM
Last Post: ndc85430
  Saving a download of stopwords (nltk) Drone4four 1 9,304 Nov-19-2020, 11:50 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020