Python Forum

Full Version: Word co-occurrence matrix for a string (NLP)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I need to create a word co-occurrence matrix that shows how many times one word in a vocabulary precedes all other words in the vocabulary for a given corpus.

The input sentence can be tokenized or not. The method has to be scalable to a sentence that is millions of words long, so much be efficient.

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
I would want this to give an output of:

Output:
[[0. 2. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 2.] [0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0.] [1. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0.]]
For example, the 2 in (row1, col2) shows that 'i' follows 'hello' twice.

How can I implement something like this using sklearn?
Take a look at NLTK: https://www.nltk.org/