Python Forum

Full Version: Saving a download of stopwords (nltk)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I’ve got a basic Django project. One feature I am working on counts the number of most commonly occurring words in a .txt file, such as a large public domain book. I’ve used the Python Natural Language Tool Kit to filter out “stopwords” (in SEO language, that means redundant words such as ‘the’, ‘you’, etc. ).

Anyways, I’m getting this traceback on my Django server:

Quote: Resource [93mstopwords[0m not found.
Please use the NLTK Downloader to obtain the resource:

[31m>>> import nltk
>>> nltk.download('stopwords')
[0m
For more information see: https://www.nltk.org/data.html

After Googling around, I discovered the reason why is because I need to download the library of stopwords. To resolve the issue, I simply open a Python REPL on my remote server and invoke these two straight forward lines:

>>> import nltk 
>>> nltk.download('stopwords')
That resolves the issue, but only temporarily. As soon as the REPL session is terminated, the error returns. I figure I need to use the built in .save class method but I am not sure which attribute to pair it with.

Here are the relevant lines from my utils.py file:

import re
from collections import Counter
from nltk.corpus import stopwords #library used to filter out common english words to produce more meaningful output
from blogs.models import Posts

def top_word_counts(text):
	stoplist = stopwords.words('english')
	stoplist.extend(["said", "gutenberg", "could", "would", "shall", "unto", "thou", "thy", "ye", "thee","upon", "hath","came", "come","things", "also", "saying", "say"])
	# Added the mechanism to extend the list to include integers between 0 and 1999
	extendinteger = list(range(0, 2000))
	# Using map() it will convert the given type with one by iterations
	# of the array and convert to the corresponding type
	stoplist.extend(list(map(str,extendinteger)))
	clean = []
	for word in re.split(r"\W+", text):
		if word not in stoplist:
			clean.append(word)
	top_10 = Counter(clean).most_common(10)
	return top_10
I tried adding import nltk to the top of this script and adding nltk.download('stopwords') to different locations within the top_word_counts function but that didn’t work.

So my question is: How do I invoke nltk.download('stopwords') so that it automatically runs once without having to manually load it in the Python REPL?

Here is the utility file in full in my GitHub repo.

I decided to post to the General Coding Help forum instead of web development because the answer to my question is more to do with Python in general rather than being specific to Django.
It will download to a system-wide directory,so it's a one time operation.
Eg on Windows.
>>> import nltk 
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tom\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
So now will import work every time.
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> stoplist[:5]
['i', 'me', 'my', 'myself', 'we']
When use Django is most common(highly advisable) to run in a virtual environment.
Then can point download to that folder,so it get data from environment folder and not system-wide.
>>> import nltk 
>>> nltk.download('stopwords', download_dir='E:/div_code/django_env/nltk_data')
[nltk_data] Downloading package stopwords to
[nltk_data]     E:/div_code/django_env/nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
True