Dec-23-2018, 08:08 AM
I have a dataframe (over 1m rows) where one of the columns contains a different sentence in each row.
I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.
I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?
I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.
I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?
def TextToArray(Q): return Q.split() def CreateWordList(data): m=len(data) import numpy as np words = np.empty(shape=0,dtype=str) temp_string = '' for i in range(m): temp_string = temp_string + ' ' + data[i] words=TextToArray(temp_string) words, count = np.unique(words,return_counts=True) result = np.append(words,counts,axis=1) return words