Mar-23-2017, 10:32 AM
Hi guys,
I'm having a bit of trouble understanding the whole bag of words concept. I need to make two vectors for a sentiment analysis, each populated with the number of times a word appears in the positive vs negative movie reviews. Each position in the vectors needs to correspond to the same word, e.g. positive_reviews(22) refers to the same word as negative_reviews. Also, they need to have the entire vocabulary across negative and positive reviews. (e.g. it's possible that some positions are zero)
The following bag of words explanation makes sense to me (found this online), but I don't know how to apply it coding wise:
D1 : cat sat mat
D2 : dog hate cat
Vocabulary: cat, dog, hate, mat, sat
You have got 5 words. I sorted them in lexicographical order. This is what I meant by the first line of this answer. Your vocabulary size is 5. So the vector will have 5 dimensions...
-------------------------------------------------
| - cat - | - dog - | - hate - | - mat - | - sat - |
-------------------------------------------------
Let's fit our documents in the transform:-
D1 : [ 1, 0, 0, 1, 1]
D2 : [ 1, 1, 1, 0, 0]
How would I do that with my movie reviews?
I did the following so far:
D1 : [ 1, 0, 0, 1, 1]
D2 : [ 1, 1, 1, 0, 0]
Thank you!!!
I'm having a bit of trouble understanding the whole bag of words concept. I need to make two vectors for a sentiment analysis, each populated with the number of times a word appears in the positive vs negative movie reviews. Each position in the vectors needs to correspond to the same word, e.g. positive_reviews(22) refers to the same word as negative_reviews. Also, they need to have the entire vocabulary across negative and positive reviews. (e.g. it's possible that some positions are zero)
The following bag of words explanation makes sense to me (found this online), but I don't know how to apply it coding wise:
D1 : cat sat mat
D2 : dog hate cat
Vocabulary: cat, dog, hate, mat, sat
You have got 5 words. I sorted them in lexicographical order. This is what I meant by the first line of this answer. Your vocabulary size is 5. So the vector will have 5 dimensions...
-------------------------------------------------
| - cat - | - dog - | - hate - | - mat - | - sat - |
-------------------------------------------------
Let's fit our documents in the transform:-
D1 : [ 1, 0, 0, 1, 1]
D2 : [ 1, 1, 1, 0, 0]
How would I do that with my movie reviews?
I did the following so far:
#Create function to clean all reviews def review_cleanup( raw_review ): review_text = BeautifulSoup(raw_review, "lxml").get_text() letters_only = re.sub("[^a-zA-Z]"," ",review_text) words = letters_only.lower().split() stops = set(stopwords.words("english")) meaningful_words = [w for w in words if not w in stops] return( " ".join(meaningful_words)) #Number of reviews num_reviews = data["review"].size print(num_reviews) #Create positive review and negative review list and clean them with the function review_cleanup positive_reviews = [] negative_reviews = [] for i in range( 0, num_reviews ): if data["sentiment"][i] == 1: positive_reviews.append(review_cleanup(data["review"][i])) else: if data["sentiment"][i] == 0: negative_reviews.append(review_cleanup(data["review"][i]))Now I was trying to apply the bag of words but I don't think I'm doing it correctly...
print ("Creating the bag of words...\n") from sklearn.feature_extraction.text import CountVectorizer #Create a bag of words vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, \ preprocessor = None, stop_words = None) #Transform data to vectors data_features_positive = vectorizer.fit_transform(positive_reviews) data_features_positive = data_features_positive.toarray() data_features_negative = vectorizer.fit_transform(negative_reviews) data_features_negative = data_features_negative.toarray()How do I get the vectors to look like my example above?

D1 : [ 1, 0, 0, 1, 1]
D2 : [ 1, 1, 1, 0, 0]
Thank you!!!