Python Forum

Full Version: Determining string intersection (embeddings) between two narrative (text) fields
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello. I am trying to solve a text analytics problem.

I have two data sets (DS1 and DS2). DS1 contains a narrative text field (Description_S).
DS2 contains a narrative text field (Description_C). The two narrative fields (Description_S and Description_C) are from completely different systems and should NEVER have common content. Specifically, it has been discovered that content from Description_S is being either typed or copied-and-pasted into the Description_C field which is a major issue. So, I am trying to use Python to determine if there are common strings or an intersection between the two fields. Both narrative fields can get quite lengthy. Description_S is actually type CLOB in Teradata. Any ideas on how to solve this issue? I had looked at the set/intersection method in Python, but wanted to get advice from the Forum first.

Thank you in advance.
Have you looked at difflib? I think it would give better answers than sets, but it would be a lot slower. If you decided to go with set intersections, I would look at removing stop words before comparing.
(Jun-24-2019, 02:32 PM)ichabod801 Wrote: [ -> ]Have you looked at difflib? I think it would give better answers than sets, but it would be a lot slower. If you decided to go with set intersections, I would look at removing stop words before comparing.

I had not considered difflib, but I've been researching it since your post. Thank you for the suggestion.
I think it could be a viable option, so I'm going to try both methods.