Feb-27-2018, 07:15 PM
I need to create a word co-occurrence matrix that shows how many times one word in a vocabulary precedes all other words in the vocabulary for a given corpus.
The input sentence can be tokenized or not. The method has to be scalable to a sentence that is millions of words long, so much be efficient.
How can I implement something like this using sklearn?
The input sentence can be tokenized or not. The method has to be scalable to a sentence that is millions of words long, so much be efficient.
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']I would want this to give an output of:
Output:[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]
For example, the 2 in (row1, col2) shows that 'i' follows 'hello' twice.How can I implement something like this using sklearn?