Python Forum

Full Version: Statistics: Two histograms based on word frequency vectors
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi guys!

I am stuck on a statistics question. I have two vectors, vector_negative and vector_positive. These vectors are populated with the number of times certain words appear in movie reviews. The wordlist itself consists of about 17,000 words and each position in the vectors refer to the same word within the wordlist. For example, wordlist[22] = 'cat', vector_negative[22] = 12, vector_positive[22] = 3. This would mean that the word 'cat' appears 12 times in negative and 3 times in my positive movie reviews.

So far so good...I need to make two histograms based on vector_negative and vector_positive and then implement a statistical method, which tests whether these histograms are statistically significant.

I have no idea how to go about that. Cry  I'm not very familiar with histograms and how they work. Huh Huh 

Many thanks!
Neither are we. This is a Python forum, not a statistics forum.

Now, in your histograms, what would be the values, and what would be the variable.
You can plot histogram with mathplotlib.pyplot.hist. As histogram shows distribution of numerical data, you will need to convert your vector and replace value counts with repeated values (instead v[22]=3 you need 22,22,22), numpy.repeat can do it. If you want to get "comparable" histograms, you should use same binning (bins parameter) for both negative and positive vectors.

Example histogram:
import matplotlib.pyplot as plt
from random import random

data = [random() for x in range(50)]
plt.hist(data, edgecolor='black')
plt.show()
[Image: k1Tf0Ra.png]
"Testing significance of histograms" does not make much sense, maybe it means some test used on binned data?