Python Forum

Full Version: Creating matrix counting words in list of strings
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have a dataframe (over 1m rows) where one of the columns contains a different sentence in each row.

I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.

I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?

def TextToArray(Q):
    return Q.split()

def CreateWordList(data):
    
    m=len(data)
    
    import numpy as np
    words = np.empty(shape=0,dtype=str)
    temp_string = ''
    for i in range(m):
        temp_string = temp_string + ' ' + data[i]
    words=TextToArray(temp_string)

    words, count = np.unique(words,return_counts=True)
    
    result = np.append(words,counts,axis=1)

    return words
First, you have some extra stuff going on that doesn't actually do anything (line 9 for instance creates an empty container and is overwritten before use). Lines 10 through 13 can be done on a single line using str.join(). Here's a rewritten version:

import numpy as np

def CreateWordList(data):
    words, count = np.unique(' '.join(data).split(),return_counts=True)
    result = np.append(words,counts,axis=1)

    return words
The function returns words, but I believe it should be returning result. If that's correct, then we can do this:

import numpy as np

def CreateWordList(data):
    words, count = np.unique(' '.join(data).split(),return_counts=True)
    return np.append(words,counts,axis=1)
Now, this methodology will always have issues because we're combining 1M+ strings into one and then processing it. You may get better performance with collections.Counter instead of numpy since we can put each string through Counter and still get the desired result:

import collections

def create_word_list(data):
    count = collections.Counter()
    for words in data:
        count.update(words.split())

    return count
You could also write the script to employ multithreading and update a master counter with each return from create_word_list(). That master count would need a lock added to it for thread safety.
Thank you.

Yes, I did mean to return result. To be honest, this was an edit of the actual code for the purposes of this question and I just overlooked changing this.

Merry Christmas!