Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Language Processing
#1
I have a python/django project I had spent a lot of money on with an upwork developer that has come to naught. I am trying to figure out what direction to turn to. The natural language processing requirements seem simple, but the more tinkering the developer did, the worse the results were.

Basically, the app involves taking the name of a medical source like a doctor or hospital from an hmtl document and finding the address for that medical source in a public database. Works like a charm where there is a match between the name found in the html and that found in the database. But there is frequent disparity.

The name of the medical source found in the html doc has been entered by a government worker. There are no data integrity rules, so the operator is often very sloppy in calling the doctor or hospital by their correct name. But the operator is not totally sloppy, and usually the text they enter is identifiable to those reading the html table. In my experience, it seems the operator tries to at least use one or two words that are key, and would lead someone familiar with hospitals in the area to the right hospital. For example, here in Dallas Fort Worth, where I am, the word "Parkland" is all you need to lead you to the county hospital for Dallas County. So an operator feels free to enter "Parkland Medical Center" or "Parkland Hospital" or for that matter "Parkland will Kill You" and know they have more or less identified the correct hospital.

So it seems to me there ought to be a way for tag or mark somehow words that are unique and identifying such that accurate matches could be made. The code that has been created limits the match attempts to those in a given geographical area that matches where the individual concerned lives.

So it seems to me there ought to be a way to classify words or word sets as being unique and identifying for a given medical source. I just don't know what tools to be looking at. ANY help would be appreciated.
Reply
#2
I don't know how much time you're willing to spend with your project, but there are a few things worth mentioning.

Quick and dirty:
There's an algorithm that was written by the US Census Bureau quite some time ago that is capable of grouping similar words based on their phonetic spelling, called Soundex. It's actually a very simple algorithm, and you can get a pretty good match. There are lots of google references to this, here's one: https://ourcodeworld.com/articles/read/2...-languages and on PyPi, several hits: https://pypi.org/search/?q=Soundex

Fun but more involved:
If you want to get a really high hit rate, you can use a package like NLTK (Natural Language Toolkit) https://www.nltk.org/
Much slower learning curve, but spectacular results can be had.

Probably most time consuming:
Also, you can roll your own Natural Language Processing, to do this, I would recommend a good book on the subject.
One I am reading now, is excellent 'Natural Language Processing in action', Manning publishing: https://www.manning.com/books/natural-la...n%20action

Probably the first option will work for you.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Suggestion needed for Natural Language Processing Project Jojo1268 3 2,273 Jun-14-2019, 09:24 AM
Last Post: Jojo1268

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020