Efficient method to find phrase in string

Tuxedo · Feb-22-2021, 10:13 PM

I'm creating an application to automatically categorize personal financial transactions based on the description (I know some institutions do that already but not any of the ones I use). My thought on categorization is to use a preassigned list of key phrases and categories. Let's say there are n=6000 transactions and p=300 key phrase / category combinations. I don't want to have a nested loop that performs n*p phrase-in-phrase operations.

My current thinking using SQL terminology is to have a single loop of size p containing a query using a WHERE DESCRIPTION LIKE phrase. That way I only have a single loop that is only as big as the size of my key phrase list. Pandas probably has a similar capability.

Q1: Is there a more efficient approach than what I describe?

This morning I started creating the key phrase list. My goal is to be able to automatically categorize 90% of the transactions. After an hour I was only through the D's. This is turning out to be a far more difficult problem than I first thought. Example: AMERICAN could be Travel as in airlines or payment to AMERICAN EXPRESS credit card. For each phrase I'm contemplating I perform a sample search to see how good it captures the desired transactions, then adjust based on the result. This is turning out to be monumental task.

Q2: Any thoughts on how to create a good key phrase list?

**Gribouillis** · (This post was last modified: Feb-23-2021, 10:16 AM by Gribouillis.)

My approach would be to tokenize things. For each key phrase, create the set of words that it contains. Then create a tree based on these sets of words. For example if 70 of the 300 key phrases contain the word 'FOO', then a transaction containing that word need only search a key phrase among these 70. A transaction not containing that word need only search a key phrase among the 230 remaining key phrases. Thus you can build a classifying tree for the key phrases prior to examining the transactions. This tree can be compiled once for all.

By looping over the transactions, you first create the set of words contained in the transaction and you find its position in the tree of key phrases. Then you have only a small set of key phrases to check for each transaction.

Tuxedo · Feb-23-2021, 06:24 PM

Hmmm, interesting idea. If I understand correctly, using my simple example of American, the tree node might have two branches, Air or Express. Or LUMBER might have two branches, LUMBER and PLUMBER. I can see that the hard work is coming up with good choices for the tokens and tree structure. If dealing with a large number of tokens, it might be worthwhile to implement the structure as a hash table. Much contemplating to do. Thanks!

**Gribouillis** · (This post was last modified: Feb-23-2021, 07:57 PM by Gribouillis.)

PLUMBER is not a branch of LUMBER because I take only full words. This is necessary to keep good performance in the tokenization of every transaction.

Also it occurs to me that another issue in this algorithm comes from the words that transactions may contain and that are not part of the key phrase. These words could entail mis-routings in the tree.

Tuxedo · Feb-24-2021, 07:03 PM

Ah, I think I understand now. Your suggestion is to tokenize the words in the description field to get a handle on how big the keyword/phrase list needs to be.

**Gribouillis** · Feb-24-2021, 07:55 PM

Without actual data, it is difficult to go farther.

Tuxedo · Feb-25-2021, 07:23 PM

I actually have 6k transactions over a 5 year period. What I did was clean up each line of text then tokenize. I wrote the cleaned up lines to a csv file as well as the tokenized list (1700 unique items). Between the two I think I'll be able to come up with a good keyword/phrase list. Thanks for your help!

# clean strings of unwanted characters, then sort and write to csv
fname = 'Description'
df[fname] = [x.replace('&','') for x in df[fname]] # make sure AT&T ==> ATT
df[fname] = [sub('[^A-Z9]+', ' ', x) for x in df[fname]]
df[fname] = df[fname].str.strip()
df['length'] = df.Description.str.len()
df = df[df.length > 1]
df.sort_values(by=[fname], inplace=True)
df.to_csv('Data\\cleaned.csv', index=False)
print(df)

# tokenize the strings
df['Description'] = df['Description'].apply(word_tokenize)
print(df)


# combine all tokens into a single list
big_list = []
for x in df['Description']:
    big_list.extend(x)


list2 = set(big_list)
print(len(list2))
df = pd.DataFrame(list2,columns=['KeyWords'])
df.sort_values(by=['KeyWords'], inplace=True)

df['length'] = df.KeyWords.str.len()
df = df[df.length > 1]

df.to_csv('Data\\cleaned2.csv', index=False)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Regex find string then return 10 character after it	pyStund	6	1,572	Aug-04-2022, 11:26 PM Last Post: Pedroski55
	Architecting Efficient Plot	blipton	0	1,321	Jan-03-2021, 07:44 PM Last Post: blipton
	material for OOP and efficient numrical programming	paul18fr	0	2,087	Sep-11-2019, 08:36 PM Last Post: paul18fr

Efficient method to find phrase in string

User Panel Messages

Announcements