Python Forum
Determining string intersection (embeddings) between two narrative (text) fields
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Determining string intersection (embeddings) between two narrative (text) fields
#1
Hello. I am trying to solve a text analytics problem.

I have two data sets (DS1 and DS2). DS1 contains a narrative text field (Description_S).
DS2 contains a narrative text field (Description_C). The two narrative fields (Description_S and Description_C) are from completely different systems and should NEVER have common content. Specifically, it has been discovered that content from Description_S is being either typed or copied-and-pasted into the Description_C field which is a major issue. So, I am trying to use Python to determine if there are common strings or an intersection between the two fields. Both narrative fields can get quite lengthy. Description_S is actually type CLOB in Teradata. Any ideas on how to solve this issue? I had looked at the set/intersection method in Python, but wanted to get advice from the Forum first.

Thank you in advance.
Reply
#2
Have you looked at difflib? I think it would give better answers than sets, but it would be a lot slower. If you decided to go with set intersections, I would look at removing stop words before comparing.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
(Jun-24-2019, 02:32 PM)ichabod801 Wrote: Have you looked at difflib? I think it would give better answers than sets, but it would be a lot slower. If you decided to go with set intersections, I would look at removing stop words before comparing.

I had not considered difflib, but I've been researching it since your post. Thank you for the suggestion.
I think it could be a viable option, so I'm going to try both methods.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Curves Intersection silverfish 1 2,173 May-23-2018, 10:15 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020