Python Forum
Saving a download of stopwords (nltk)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Saving a download of stopwords (nltk)
#1
I’ve got a basic Django project. One feature I am working on counts the number of most commonly occurring words in a .txt file, such as a large public domain book. I’ve used the Python Natural Language Tool Kit to filter out “stopwords” (in SEO language, that means redundant words such as ‘the’, ‘you’, etc. ).

Anyways, I’m getting this traceback on my Django server:

Quote: Resource [93mstopwords[0m not found.
Please use the NLTK Downloader to obtain the resource:

[31m>>> import nltk
>>> nltk.download('stopwords')
[0m
For more information see: https://www.nltk.org/data.html

After Googling around, I discovered the reason why is because I need to download the library of stopwords. To resolve the issue, I simply open a Python REPL on my remote server and invoke these two straight forward lines:

>>> import nltk 
>>> nltk.download('stopwords')
That resolves the issue, but only temporarily. As soon as the REPL session is terminated, the error returns. I figure I need to use the built in .save class method but I am not sure which attribute to pair it with.

Here are the relevant lines from my utils.py file:

import re
from collections import Counter
from nltk.corpus import stopwords #library used to filter out common english words to produce more meaningful output
from blogs.models import Posts

def top_word_counts(text):
	stoplist = stopwords.words('english')
	stoplist.extend(["said", "gutenberg", "could", "would", "shall", "unto", "thou", "thy", "ye", "thee","upon", "hath","came", "come","things", "also", "saying", "say"])
	# Added the mechanism to extend the list to include integers between 0 and 1999
	extendinteger = list(range(0, 2000))
	# Using map() it will convert the given type with one by iterations
	# of the array and convert to the corresponding type
	stoplist.extend(list(map(str,extendinteger)))
	clean = []
	for word in re.split(r"\W+", text):
		if word not in stoplist:
			clean.append(word)
	top_10 = Counter(clean).most_common(10)
	return top_10
I tried adding import nltk to the top of this script and adding nltk.download('stopwords') to different locations within the top_word_counts function but that didn’t work.

So my question is: How do I invoke nltk.download('stopwords') so that it automatically runs once without having to manually load it in the Python REPL?

Here is the utility file in full in my GitHub repo.

I decided to post to the General Coding Help forum instead of web development because the answer to my question is more to do with Python in general rather than being specific to Django.
Reply
#2
It will download to a system-wide directory,so it's a one time operation.
Eg on Windows.
>>> import nltk 
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tom\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
So now will import work every time.
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> stoplist[:5]
['i', 'me', 'my', 'myself', 'we']
When use Django is most common(highly advisable) to run in a virtual environment.
Then can point download to that folder,so it get data from environment folder and not system-wide.
>>> import nltk 
>>> nltk.download('stopwords', download_dir='E:/div_code/django_env/nltk_data')
[nltk_data] Downloading package stopwords to
[nltk_data]     E:/div_code/django_env/nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
True
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with simple nltk Chatbot Extra 3 1,878 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  download with internet download manager coral_raha 0 2,940 Jul-18-2021, 03:11 PM
Last Post: coral_raha
  Installing nltk dependency Eshwar 0 1,825 Aug-30-2020, 06:10 PM
Last Post: Eshwar
  Analyzing large text file with nltk.corpus (stopwords ) Drone4four 9 6,456 Jun-06-2019, 09:30 PM
Last Post: Drone4four
  Clean Data using NLTK disruptfwd8 0 3,321 May-12-2018, 11:21 PM
Last Post: disruptfwd8
  Text Processing and NLTK (POS tagging) TwelveMoons 2 4,892 Mar-16-2017, 02:53 AM
Last Post: TwelveMoons
  NLTK create corpora pythlang 5 10,174 Oct-26-2016, 07:31 PM
Last Post: Larz60+
  serious n00b.. NLTK in python 2.7 and 3.5 pythlang 24 19,698 Oct-21-2016, 04:15 PM
Last Post: pythlang
  Corpora catalof for NLTK Larz60+ 1 4,107 Oct-20-2016, 02:31 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020