Python Forum
Newbie Help with NLU and Searching
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Newbie Help with NLU and Searching
#1
Hi,

while on my winter break I am going to work on improving my python development skills. I have a solid java background but I want to branch into python for it's machine learning capabilities.

So what I want to do is work on a prototyping a solution for a problem I have at work.

I work for a bank and what I want to do is prototype a solution which OCR's documents and extracts metadata from them to enable searching. We have millions and millions of documents sitting in repositories that are currently unsearchable.

I want the solution to be document agnostic: one solutions working against all documents.

Hopefully someone with machine learning and search skills can set me in the right direction.

So I was thinking of something like this:

First Method:
1. Take all the words in a document.
2. Get rid of the superfluous words: and, the, then etc.
3. For each word left count how many of each instance there is: For example: House 23, car 15, London 10 etc.
4. Get a count of all the words and figure out what the 80th percentile is.
5. Any words with at least that many instances, to the right of the 80th percentile, become search terms, the rest of the words are deleted.

So a mortgage document might end up with the words: mortgage, rate a persons name.

Second method.
1. Take all the words in a document.
2. Run a named entity algorithm against the document pulling out people, places and things.
3. Do something similar to point 4 above.
4. The remaining words become search terms.

As for searching I am keeping it simple something like this:
1. A database with two columns: Document Link, Document Meta Data (list of search words: House, mortgage, London)
2. A search is a simple select returning a list of Document Links where the search term is in the Document Meta Data field.

Not sure if the above examples are the way to go about this.

Hopefully someone with a background in in natural language understanding can set me on the right path.

Thanks

Alex
Reply
#2
I have the impression you have a good idea how to do the job. But you will run into lots of details where you will need to make choices and lots of issues to solve.
So my advice would be to have a look to commercial products where all these details are pinpointed and solved.
One commercial product that I know of is Oracle Text. It builds indexes of texts and can handle multiple languages.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020