Determining string intersection (embeddings) between two narrative (text) fields - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Determining string intersection (embeddings) between two narrative (text) fields (/thread-19351.html) |
Determining string intersection (embeddings) between two narrative (text) fields - twinpiques - Jun-24-2019 Hello. I am trying to solve a text analytics problem. I have two data sets (DS1 and DS2). DS1 contains a narrative text field (Description_S). DS2 contains a narrative text field (Description_C). The two narrative fields (Description_S and Description_C) are from completely different systems and should NEVER have common content. Specifically, it has been discovered that content from Description_S is being either typed or copied-and-pasted into the Description_C field which is a major issue. So, I am trying to use Python to determine if there are common strings or an intersection between the two fields. Both narrative fields can get quite lengthy. Description_S is actually type CLOB in Teradata. Any ideas on how to solve this issue? I had looked at the set/intersection method in Python, but wanted to get advice from the Forum first. Thank you in advance. RE: Determining string intersection (embeddings) between two narrative (text) fields - ichabod801 - Jun-24-2019 Have you looked at difflib? I think it would give better answers than sets, but it would be a lot slower. If you decided to go with set intersections, I would look at removing stop words before comparing. RE: Determining string intersection (embeddings) between two narrative (text) fields - twinpiques - Jun-30-2019 (Jun-24-2019, 02:32 PM)ichabod801 Wrote: Have you looked at difflib? I think it would give better answers than sets, but it would be a lot slower. If you decided to go with set intersections, I would look at removing stop words before comparing. I had not considered difflib, but I've been researching it since your post. Thank you for the suggestion. I think it could be a viable option, so I'm going to try both methods. |