NLTK create corpora

pythlang · (This post was last modified: Oct-26-2016, 01:28 AM by pythlang.)

Hi guys,

Pretty straightforward and most likely easy question for you guys here:

I'm trying to create and use my own corpora saved as a .txt file, however, it is not being found

There are two files and their directory is as follows:

/jordanxxx/nltk_data/corpora/short_reviews/neg/neg.txt
/jordanxxx/nltk_data/corpora/short_reviews/pos/pos.txt

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append((r, "pos"))

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)

Error:Traceback (most recent call last):
 File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 37, in <module>
   short_pos = open("short_reviews/pos.txt", "r").read
IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'

I have already tried:

f=open('neg.txt', 'rU')

Error:>>> f=open('neg.txt','rU')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'neg.txt'

and i'm not really trying to add a whole lot of code to append paths etc unless i have to.

any input would be great as I'd really like to use my own bodies of text in the future with something as simple as converting it to a .txt file and copy+pasting into an appropriate spot.

EDIT: I am using Homebrew if that is of any significance

**Larz60+** · Oct-26-2016, 03:31 AM

You could look at the downloader.py file (source available here)
There are probably some hooks that you have to set within nltk itself so it knows about your corpus.

pythlang · (This post was last modified: Oct-26-2016, 03:53 AM by pythlang.)

(Oct-26-2016, 03:31 AM)Larz60+ Wrote: You could look at the downloader.py file (source available here) There are probably some hooks that you have to set within nltk itself so it knows about your corpus.

thanks,

so do you think something like:

1.9 Loading your own Corpus

If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict. Whatever the location, set this to be the value of corpus_root [1]. The second parameter of the PlaintextCorpusReader initializer [2] can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see 3.4 for information about regular expressions).

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict' [1]
wordlists = PlaintextCorpusReader(corpus_root, '.*') [2]
wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

would work, and if so how would I write that?

I can post my attempt with traceback if needed

**Larz60+** · Oct-26-2016, 06:37 AM

That looks looks what you need. I did this a couple of years ago,
and not since. I'm afraid your going to have to dig into the book (also available on github)

pythlang · Oct-26-2016, 07:23 PM

(Oct-26-2016, 06:37 AM)Larz60+ Wrote: That looks looks what you need. I did this a couple of years ago,
and not since. I'm afraid your going to have to dig into the book (also available on github)

Thanks for pointing me in the right direction, Larz; sorry for the delay on the gratitude, I'm just attempting to create a successful coding to post back here for others after I've read through the book you've kindly provided.

**Larz60+** · Oct-26-2016, 07:31 PM

Quote:I've read through the book you've kindly provided

Correction - Book link I provided

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Help with simple nltk Chatbot	Extra	3	1,879	Jan-02-2022, 07:50 AM Last Post: bepammoifoge
	Saving a download of stopwords (nltk)	Drone4four	1	9,270	Nov-19-2020, 11:50 PM Last Post: snippsat
	Installing nltk dependency	Eshwar	0	1,825	Aug-30-2020, 06:10 PM Last Post: Eshwar
	Clean Data using NLTK	disruptfwd8	0	3,321	May-12-2018, 11:21 PM Last Post: disruptfwd8
	Text Processing and NLTK (POS tagging)	TwelveMoons	2	4,892	Mar-16-2017, 02:53 AM Last Post: TwelveMoons
	serious n00b.. NLTK in python 2.7 and 3.5	pythlang	24	19,698	Oct-21-2016, 04:15 PM Last Post: pythlang
	Corpora catalof for NLTK	Larz60+	1	4,107	Oct-20-2016, 02:31 AM Last Post: Larz60+

NLTK create corpora

User Panel Messages

Announcements