Python Forum
Analyzing large text file with nltk.corpus (stopwords )
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Analyzing large text file with nltk.corpus (stopwords )
#7
I substituted my f-string for the one you suggested. I noticed right away that the output was the same as before. It’s identical. So I thought I didn’t execute the correct file name in my shell. In the end I included both f-strings (the original and @ichabod801’s) on the same line separated by a string of three pipes. So here is what my line 17 looks like now:

print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')
The first word variable on the left includes !r where as the word variable on the right does not. Yet the output remains the same:

Quote: $ python3 script.py
'said' --> 462 ||| said --> 462
'alice' --> 403 ||| alice --> 403
'i' --> 283 ||| i --> 283
'it' --> 205 ||| it --> 205
's' --> 184 ||| s --> 184
'little' --> 128 ||| little --> 128
'you' --> 115 ||| you --> 115
'and' --> 107 ||| and --> 107
'one' --> 106 ||| one --> 106
'gutenberg' --> 93 ||| gutenberg --> 93

!r is not catching the apostrophes. I read the Python doc b]@ichabod801[/b] shared and I understand some of it. Apparently !r should filter out apostrophes that are part of a word in a string or set of strings (or in my case throughout a full length book). So my hypothesis was that the “s” and “it” and “it’s” to be removed from the output. Apparently I need a new hypothesis. I’m all out of ideas. What do you people think could the issue be here?

Remember: I’m trying to filter out all the common stopwords in a large text file so that the output should be the most commonly used nouns in the book Alice and Wonderland.

For what it’s worth, here is my script entirely so far up to this point:

from collections import Counter
from nltk.corpus import stopwords
import re
def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text
def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   clean_text = ' '.join(clean)
   words = re.findall('\w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)
Thanks again, @ichabod801.
Reply


Messages In This Thread
RE: Analyzing large text file with nltk.corpus (stopwords ) - by Drone4four - Jun-02-2019, 12:44 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Converted EXE file size is too large Rajasekaran 0 1,542 Mar-30-2023, 11:50 AM
Last Post: Rajasekaran
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,157 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  validate large json file with millions of records in batches herobpv 3 1,310 Dec-10-2022, 10:36 PM
Last Post: bowlofred
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,734 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Pyinstaller distribution file seems too large hammer 4 2,798 Mar-31-2022, 02:33 PM
Last Post: snippsat
  Initializing, reading and updating a large JSON file medatib531 0 1,813 Mar-10-2022, 07:58 PM
Last Post: medatib531
  Converted Pipe Delimited text file to CSV file atomxkai 4 7,064 Feb-11-2022, 12:38 AM
Last Post: atomxkai
  Help with simple nltk Chatbot Extra 3 1,946 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  [split] How to convert the CSV text file into a txt file Pinto94 5 3,412 Dec-23-2020, 08:04 AM
Last Post: ndc85430
  Saving a download of stopwords (nltk) Drone4four 1 9,404 Nov-19-2020, 11:50 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020