Python Forum
Word co-occurrence matrix for a string (NLP) - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Word co-occurrence matrix for a string (NLP) (/thread-8598.html)



Word co-occurrence matrix for a string (NLP) - JoeB - Feb-27-2018

I need to create a word co-occurrence matrix that shows how many times one word in a vocabulary precedes all other words in the vocabulary for a given corpus.

The input sentence can be tokenized or not. The method has to be scalable to a sentence that is millions of words long, so much be efficient.

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
I would want this to give an output of:

Output:
[[0. 2. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 2.] [0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0.] [1. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0.]]
For example, the 2 in (row1, col2) shows that 'i' follows 'hello' twice.

How can I implement something like this using sklearn?


RE: Word co-occurrence matrix for a string (NLP) - Larz60+ - Feb-27-2018

Take a look at NLTK: https://www.nltk.org/


RE: Word co-occurrence matrix for a string (NLP) - Larz60+ - Feb-27-2018

Here's something that might help: https://stackoverflow.com/questions/37331708/nltk-find-occurrences-of-a-word-within-5-words-left-right-of-context-words-in