Python Forum

Hi guys,

I'm having a bit of trouble understanding the whole bag of words concept. I need to make two vectors for a sentiment analysis, each populated with the number of times a word appears in the positive vs negative movie reviews. Each position in the vectors needs to correspond to the same word, e.g. positive_reviews(22) refers to the same word as negative_reviews. Also, they need to have the entire vocabulary across negative and positive reviews. (e.g. it's possible that some positions are zero)

The following bag of words explanation makes sense to me (found this online), but I don't know how to apply it coding wise:

D1 : cat sat mat
D2 : dog hate cat

Vocabulary: cat, dog, hate, mat, sat

You have got 5 words. I sorted them in lexicographical order. This is what I meant by the first line of this answer. Your vocabulary size is 5. So the vector will have 5 dimensions...
-------------------------------------------------
| - cat - | - dog - | - hate - | - mat - | - sat - |
-------------------------------------------------

Let's fit our documents in the transform:-
D1 : [ 1, 0, 0, 1, 1]
D2 : [ 1, 1, 1, 0, 0]

How would I do that with my movie reviews?

I did the following so far:

#Create function to clean all reviews 
def review_cleanup( raw_review ):
    review_text = BeautifulSoup(raw_review, "lxml").get_text()
    letters_only = re.sub("[^a-zA-Z]"," ",review_text) 
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    return( " ".join(meaningful_words))

#Number of reviews
num_reviews = data["review"].size
print(num_reviews)

#Create positive review and negative review list and clean them with the function review_cleanup
positive_reviews = []
negative_reviews = []

for i in range( 0, num_reviews ):
    if data["sentiment"][i] == 1:
         positive_reviews.append(review_cleanup(data["review"][i]))
    else:
        if data["sentiment"][i] == 0:
         negative_reviews.append(review_cleanup(data["review"][i]))

Now I was trying to apply the bag of words but I don't think I'm doing it correctly...

print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

#Create a bag of words

vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, \
preprocessor = None, stop_words = None) 

#Transform data to vectors
data_features_positive = vectorizer.fit_transform(positive_reviews)
data_features_positive = data_features_positive.toarray()

data_features_negative = vectorizer.fit_transform(negative_reviews)
data_features_negative = data_features_negative.toarray()

How do I get the vectors to look like my example above? Huh

D1 : [ 1, 0, 0, 1, 1]
D2 : [ 1, 1, 1, 0, 0]

Thank you!!!

from sklearn.feature_extraction.text import CountVectorizer

sentences = ['dont iterate over rows of dataframe',
             'try to use dataframe indexing']

vec = CountVectorizer()
vectors = vec.fit_transform(sentences).toarray()

print(sorted(((v, k) for k,v in vec.vocabulary_.items())))
print(vectors[0])
print(vectors[1])

Output:[(0, 'dataframe'), (1, 'dont'), (2, 'indexing'), (3, 'iterate'), (4, 'of'), (5, 'over'), (6, 'rows'), (7, 'to'), (8, 'try'), (9, 'use')]
[1 1 0 1 1 1 1 0 0 0]
[1 0 1 0 0 0 0 1 1 1]

fancy_panther

zivoni