Python Forum
Thread Rating:
  • 2 Vote(s) - 1.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Naive Bayes too slow
#1
So, after fooling around with this algorithm I've noticed that it's entirely too slow since it's a learning kit, especially for analyzing large sets of data.

I want to be able to retain the function of Naive Bayes without the insane amount of time it takes to process.

Can I use scikitlearn as a wrapper of some sort instead? 

That seems like it would be better equipped to deal with the problem.

Here's my code, feel free to make revisions in addition to helping me speed up the processing time:

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev), category) for (rev, category) in documents]

training_set = featuresets[:1900]
testing_set = featuresets[:1900:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)
Output:
[color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']False, u'effected': False, u'compared': False, u'nonetheless': False, u'deadly': False, u'purproses': False, u'lately': False, u'kerrigans': False, u'compares': False, u'details': False, u'behold': False, u'vulgarize': False, u'illusion': False, u'ponytail': False, u'rebelled': False, u'repeat': False, u'zhou': False, u'treason': False, u'allotting': False, u'impregnating': False, u'tinier': False, u'trunchbull': False, u'laude': False, u'exposure': False, u'searches': False, u'ustinov': False, u'disatisfaction': False, u'mishears': False, u'torrid': False, u'compete': False, u'lestat': False, u'villainous': False, u'searched': False, u'gardens': False, u'homerian': False}[/font][/size][/font][/size] [/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']('Naive Bayes Algo accuracy percent:', 87.78947368421053)[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']Most Informative Features[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']              insulting = True              neg : pos    =     10.6 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                   sans = True              neg : pos    =      8.4 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                wasting = True              neg : pos    =      8.4 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']           refreshingly = True              pos : neg    =      8.3 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']             mediocrity = True              neg : pos    =      7.7 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']              dismissed = True              pos : neg    =      7.0 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']            bruckheimer = True              neg : pos    =      6.3 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']              sumptuous = True              pos : neg    =      6.3 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']             cronenberg = True              pos : neg    =      6.3 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                 fabric = True              pos : neg    =      6.3 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                    ugh = True              neg : pos    =      5.8 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                 doubts = True              pos : neg    =      5.8 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                 bounce = True              neg : pos    =      5.7 : 1.0[/font][/size][/font][/size][/color] [color=#333333][size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback']                  wires = True              neg : pos    =      5.7 : 1.0[/font][/size][/font][/size][/color] [size=small][font=-apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', HelveticaNeue-Light, Ubuntu, 'Droid Sans', sans-serif][size=x-small][font=Monaco, Menlo, Consolas, 'Droid Sans Mono', Inconsolata, 'Courier New', monospace, 'Droid Sans Fallback'][color=#333333]                   wits = True              pos : neg    =      5.7 : 1.0[/color][/font][/size][/font][/size]
Reply
#2
(Oct-21-2016, 09:22 PM)pythlang Wrote: I want to be able to retain the function of Naive Bayes without the insane amount of time it takes to process.
What to mean bye long time,that code takes 9-sec for me.
Reply
#3
(Oct-21-2016, 09:48 PM)snippsat Wrote:
(Oct-21-2016, 09:22 PM)pythlang Wrote: I want to be able to retain the function of Naive Bayes without the insane amount of time it takes to process.
What to mean bye long time,that code takes 9-sec for me.

It takes like 5 minutes for me.

EDIT: What could be causing this to happen?
Reply
#4
(Oct-21-2016, 09:52 PM)pythlang Wrote:
(Oct-21-2016, 09:48 PM)snippsat Wrote:
(Oct-21-2016, 09:22 PM)pythlang Wrote: I want to be able to retain the function of Naive Bayes without the insane amount of time it takes to process.
What to mean bye long time,that code takes 9-sec for me.

It takes like 5 minutes for me.

EDIT: What could be causing this to happen?

Not enough memory causing swapping? See your process monitor displays....
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#5
(Oct-21-2016, 10:03 PM)Ofnuts Wrote:
(Oct-21-2016, 09:52 PM)pythlang Wrote:
(Oct-21-2016, 09:48 PM)snippsat Wrote:
(Oct-21-2016, 09:22 PM)pythlang Wrote: I want to be able to retain the function of Naive Bayes without the insane amount of time it takes to process.
What to mean bye long time,that code takes 9-sec for me.
It takes like 5 minutes for me. EDIT: What could be causing this to happen?
Not enough memory causing swapping? See your process monitor displays....

how would i be able to view/change this and what are pretty acceptable standards for these types of processes?
Reply
#6
(Oct-21-2016, 09:52 PM)pythlang Wrote: EDIT: What could be causing this to happen?
You have downloaded all  NLTK data?
>>> import nltk
>>> nltk.download()
Quote:A new window should open, showing the NLTK Downloader.
Click on the File menu and select Change Download Directory.
For central installation, set this to C:\nltk_data (Windows),
/usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix).
Next, select the packages or collections you want to download.
Reply
#7
(Oct-21-2016, 10:18 PM)snippsat Wrote:
(Oct-21-2016, 09:52 PM)pythlang Wrote: EDIT: What could be causing this to happen?
You have downloaded all  NLTK data?
>>> import nltk >>> nltk.download()
Quote: A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central installation, set this to C:\nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the packages or collections you want to download.

Jordans-MBP:~ jordan$ which python
/usr/bin/python
Jordans-MBP:~ jordan$ python
Python 2.7.10 (default, Jul 30 2016, 18:31:42) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path
['', '/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Users/jordanXXX/Library/Python/2.7/lib/python/site-packages', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']
>>> quit()
Jordans-MBP:~ jordan$ python3
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path
['', '/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk-3.2.1-py3.5.egg', '/Library/Frameworks/Python.framework/Versions/3.5/lib/python35.zip', '/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5', '/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/plat-darwin', '/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/lib-dynload', '/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages']
>>> 
>>> import nltk.data
path in nltk.data.path
>>> path in nltk.data.path
True
>>> import os, os.path
>>> path = os.path.expanduser('~/nltk_data')
>>> if not os.path.exists(path):
...     os.mkdir(path)
...     os.path.exists(path)
... 
>>> import nltk.data
>>> path in nltk.data.path
True
>>> 
as far as I know i've downloaded all the note data or otherwise i probably wouldn't be able to use these tools and would run into  something like this which has happened when i tried to use matplotlib for the first time.
Error:
no module named "X"
are the installed paths for python3 and nltk_data ok?
Reply
#8
As I'm going along I have incurred a problem with scikit-learn. 

Can anyone shed some light on this as I have scoured Google to no avail with something that could help me that I can understand:

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB


documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

# print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev), category) for (rev, category) in documents]

training_set = featuresets[:1900]
testing_set = featuresets[:1900:]

# classifier = nltk.NaiveBayesClassifier.train(training_set)

classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

# save_classifier = open("naivebayes.pickle", "wb")
# pickle.dump(classifier, save_classifier)
# save_classifier.close()

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

GaussianNB_classifier = SklearnClassifier(GaussianNB())
GaussianNB_classifier.train(training_set)
print("GaussianNB_classifier:", (nltk.classify.accuracy(GaussianNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
Error:
Traceback (most recent call last):  File "/Users/jordanXXX/Documents/NLP/scikitlearn", line 6, in <module>    from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB  File "/Library/Python/2.7/site-packages/sklearn/__init__.py", line 56, in <module>    from . import __check_build  File "/Library/Python/2.7/site-packages/sklearn/__check_build/__init__.py", line 46, in <module>    raise_build_error(e)  File "/Library/Python/2.7/site-packages/sklearn/__check_build/__init__.py", line 41, in raise_build_error    %s""" % (e, local_dir, ''.join(dir_content).strip(), msg)) ImportError: No module named _check_build ___________________________________________________________________________ Contents of /Library/Python/2.7/site-packages/sklearn/__check_build: __init__.py               __init__.pyc              __pycache__ _check_build.cpython-35m-darwin.sosetup.py ___________________________________________________________________________ It seems that scikit-learn has not been built correctly. If you have installed scikit-learn from source, please do not forget to build the package before using it: run `python setup.py install` or `make` in the source directory. If you have used an installer, please check that it is suited for your Python version, your operating system and your platform.
EDIT:  Wall Wall Wall Wall Wall Naughty Think Snooty Pray Doh
Reply
#9
There was a build problem - see here for work around
Reply
#10
(Oct-22-2016, 02:42 AM)Larz60+ Wrote: There was a build problem - see here for work around


thanks for replying,

I read that but still unsure of what it means or how to work around it. could you clarify?

is there no way to "rebuild" scikit-learn in the proper manner?

thanks.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [nltk] Naive Bayes Classifier constantin01 0 1,963 Jun-24-2019, 10:36 AM
Last Post: constantin01

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020