Analyzing large text file with nltk.corpus (stopwords )

Drone4four · (This post was last modified: Jun-02-2019, 12:45 AM by Drone4four.)

I substituted my f-string for the one you suggested. I noticed right away that the output was the same as before. It’s identical. So I thought I didn’t execute the correct file name in my shell. In the end I included both f-strings (the original and @ichabod801’s) on the same line separated by a string of three pipes. So here is what my line 17 looks like now:

print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')

The first word variable on the left includes !r where as the word variable on the right does not. Yet the output remains the same:

Quote: $ python3 script.py
'said' --> 462 ||| said --> 462
'alice' --> 403 ||| alice --> 403
'i' --> 283 ||| i --> 283
'it' --> 205 ||| it --> 205
's' --> 184 ||| s --> 184
'little' --> 128 ||| little --> 128
'you' --> 115 ||| you --> 115
'and' --> 107 ||| and --> 107
'one' --> 106 ||| one --> 106
'gutenberg' --> 93 ||| gutenberg --> 93

!r is not catching the apostrophes. I read the Python doc b]@ichabod801[/b] shared and I understand some of it. Apparently !r should filter out apostrophes that are part of a word in a string or set of strings (or in my case throughout a full length book). So my hypothesis was that the “s” and “it” and “it’s” to be removed from the output. Apparently I need a new hypothesis. I’m all out of ideas. What do you people think could the issue be here?

Remember: I’m trying to filter out all the common stopwords in a large text file so that the output should be the most commonly used nouns in the book Alice and Wonderland.

For what it’s worth, here is my script entirely so far up to this point:

from collections import Counter
from nltk.corpus import stopwords
import re
def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text
def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   clean = [word for word in text.split() if word not in stoplist] #
   clean_text = ' '.join(clean)
   words = re.findall('\w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4} {"|||":^4} {word:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Thanks again, @ichabod801.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Converted EXE file size is too large	Rajasekaran	0	1,542	Mar-30-2023, 11:50 AM Last Post: Rajasekaran
	Need to compare the Excel file name with a directory text file.	veeran1991	1	1,157	Dec-15-2022, 04:32 PM Last Post: Larz60+
	validate large json file with millions of records in batches	herobpv	3	1,310	Dec-10-2022, 10:36 PM Last Post: bowlofred
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	1,734	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Pyinstaller distribution file seems too large	hammer	4	2,798	Mar-31-2022, 02:33 PM Last Post: snippsat
	Initializing, reading and updating a large JSON file	medatib531	0	1,813	Mar-10-2022, 07:58 PM Last Post: medatib531
	Converted Pipe Delimited text file to CSV file	atomxkai	4	7,064	Feb-11-2022, 12:38 AM Last Post: atomxkai
	Help with simple nltk Chatbot	Extra	3	1,946	Jan-02-2022, 07:50 AM Last Post: bepammoifoge
	[split] How to convert the CSV text file into a txt file	Pinto94	5	3,412	Dec-23-2020, 08:04 AM Last Post: ndc85430
	Saving a download of stopwords (nltk)	Drone4four	1	9,404	Nov-19-2020, 11:50 PM Last Post: snippsat

Analyzing large text file with nltk.corpus (stopwords )

User Panel Messages

Announcements