Python Forum
Creating matrix counting words in list of strings
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Creating matrix counting words in list of strings
#1
I have a dataframe (over 1m rows) where one of the columns contains a different sentence in each row.

I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.

I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?

def TextToArray(Q):
    return Q.split()

def CreateWordList(data):
    
    m=len(data)
    
    import numpy as np
    words = np.empty(shape=0,dtype=str)
    temp_string = ''
    for i in range(m):
        temp_string = temp_string + ' ' + data[i]
    words=TextToArray(temp_string)

    words, count = np.unique(words,return_counts=True)
    
    result = np.append(words,counts,axis=1)

    return words
Reply
#2
First, you have some extra stuff going on that doesn't actually do anything (line 9 for instance creates an empty container and is overwritten before use). Lines 10 through 13 can be done on a single line using str.join(). Here's a rewritten version:

import numpy as np

def CreateWordList(data):
    words, count = np.unique(' '.join(data).split(),return_counts=True)
    result = np.append(words,counts,axis=1)

    return words
The function returns words, but I believe it should be returning result. If that's correct, then we can do this:

import numpy as np

def CreateWordList(data):
    words, count = np.unique(' '.join(data).split(),return_counts=True)
    return np.append(words,counts,axis=1)
Now, this methodology will always have issues because we're combining 1M+ strings into one and then processing it. You may get better performance with collections.Counter instead of numpy since we can put each string through Counter and still get the desired result:

import collections

def create_word_list(data):
    count = collections.Counter()
    for words in data:
        count.update(words.split())

    return count
You could also write the script to employ multithreading and update a master counter with each return from create_word_list(). That master count would need a lock added to it for thread safety.
Reply
#3
Thank you.

Yes, I did mean to return result. To be honest, this was an edit of the actual code for the purposes of this question and I just overlooked changing this.

Merry Christmas!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Creating look up table/matrix from 3d data array chai0404 3 2,827 Apr-09-2020, 04:53 AM
Last Post: buran
  convert a list of string+bytes into a list of strings (python 3) pacscaloupsu 4 10,746 Mar-17-2020, 07:21 AM
Last Post: markfilan
  Can python detect style of language? eg. Flowery words vs simple words mcp111 4 2,379 Jan-07-2020, 02:25 PM
Last Post: mcp111
  Creating A List of DataFrames & Manipulating Columns in Each DataFrame firebird 1 4,262 Jul-31-2019, 04:04 AM
Last Post: scidam
  counting the occurence of a specified number in a numpy-matrix PhysChem 1 2,388 Apr-03-2019, 01:37 PM
Last Post: PhysChem
  Checking the elements of a matrix with an elements of a list juniorcoder 11 5,760 Sep-17-2018, 03:02 PM
Last Post: gruntfutuk
  Creating a matrix of rolling variances vvvcvvcv 1 2,734 May-26-2018, 12:51 PM
Last Post: killerrex

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020