Dec-23-2018, 08:08 AM
I have a dataframe (over 1m rows) where one of the columns contains a different sentence in each row.
I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.
I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?
I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.
I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def TextToArray(Q): return Q.split() def CreateWordList(data): m = len (data) import numpy as np words = np.empty(shape = 0 ,dtype = str ) temp_string = '' for i in range (m): temp_string = temp_string + ' ' + data[i] words = TextToArray(temp_string) words, count = np.unique(words,return_counts = True ) result = np.append(words,counts,axis = 1 ) return words |