Python Forum
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
NLTK create corpora
#1
Hi guys,

Pretty straightforward and most likely easy question for you guys here:

I'm trying to create and use my own corpora saved as a .txt file, however, it is not being found

There are two files and their directory is as follows:

/jordanxxx/nltk_data/corpora/short_reviews/neg/neg.txt
/jordanxxx/nltk_data/corpora/short_reviews/pos/pos.txt

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append((r, "pos"))

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)
Error:
Traceback (most recent call last):  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 37, in <module>    short_pos = open("short_reviews/pos.txt", "r").read IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'
I have already tried:

f=open('neg.txt', 'rU')
Error:
>>> f=open('neg.txt','rU') Traceback (most recent call last):   File "<stdin>", line 1, in <module> IOError: [Errno 2] No such file or directory: 'neg.txt'
and i'm not really trying to add a whole lot of code to append paths etc unless i have to.



any input would be great as I'd really like to use my own bodies of text in the future with something as simple as converting it to a .txt file and copy+pasting into an appropriate spot.




EDIT: I am using Homebrew if that is of any significance
Reply
#2
You could look at the downloader.py file (source available here)
There are probably some hooks that you have to set within nltk itself so it knows about your corpus.
Reply
#3
(Oct-26-2016, 03:31 AM)Larz60+ Wrote: You could look at the downloader.py file (source available here) There are probably some hooks that you have to set within nltk itself so it knows about your corpus.

thanks,

so do you think something like:


1.9   Loading your own Corpus

If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict. Whatever the location, set this to be the value of corpus_root [1]. The second parameter of the PlaintextCorpusReader initializer [2] can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see 3.4 for information about regular expressions).


from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict' [1]
wordlists = PlaintextCorpusReader(corpus_root, '.*') [2]
wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
would work, and if so how would I write that?

I can post my attempt with traceback if needed
Reply
#4
That looks looks what you need. I did this a couple of years ago,
and not since. I'm afraid your going to have to dig into the book (also available on github)
Reply
#5
(Oct-26-2016, 06:37 AM)Larz60+ Wrote: That looks looks what you need. I did this a couple of years ago,
and not since. I'm afraid your going to have to dig into the book (also available on github)

Thanks for pointing me in the right direction, Larz; sorry for the delay on the gratitude, I'm just attempting to create a successful coding to post back here for others after I've read through the book you've kindly provided.
Reply
#6
Quote:I've read through the book you've kindly provided

Correction - Book link I provided
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with simple nltk Chatbot Extra 3 1,879 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  Saving a download of stopwords (nltk) Drone4four 1 9,270 Nov-19-2020, 11:50 PM
Last Post: snippsat
  Installing nltk dependency Eshwar 0 1,825 Aug-30-2020, 06:10 PM
Last Post: Eshwar
  Clean Data using NLTK disruptfwd8 0 3,321 May-12-2018, 11:21 PM
Last Post: disruptfwd8
  Text Processing and NLTK (POS tagging) TwelveMoons 2 4,892 Mar-16-2017, 02:53 AM
Last Post: TwelveMoons
  serious n00b.. NLTK in python 2.7 and 3.5 pythlang 24 19,698 Oct-21-2016, 04:15 PM
Last Post: pythlang
  Corpora catalof for NLTK Larz60+ 1 4,107 Oct-20-2016, 02:31 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020