Python Forum

I need to create a word co-occurrence matrix that shows how many times one word in a vocabulary precedes all other words in the vocabulary for a given corpus.

The input sentence can be tokenized or not. The method has to be scalable to a sentence that is millions of words long, so much be efficient.

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']

I would want this to give an output of:

Output:[[0. 2. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 2.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]

For example, the 2 in (row1, col2) shows that 'i' follows 'hello' twice.

How can I implement something like this using sklearn?

Take a look at NLTK: https://www.nltk.org/

Here's something that might help: https://stackoverflow.com/questions/3733...t-words-in

JoeB

Larz60+

Larz60+