Python Forum
Strategy for data extraction
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Strategy for data extraction
#1
I am trying to come up with a strategy for extracting key data from generic letters for different clients. This is the format of the letter I want to parse. It should look the same for every client, although there may be minor layout differences. First I want to extract the addressee of the letter which is redacted. Second I want to extract the name of the client. It is over on the right hand margin after "Re:". Then there are 2 items of data I want from the main body of the letter: the time period of the records requested (first sentence after heading "What We Need From You".) Then I want the date in the first sentence of the third paragraph in that heading "Please respond by May 26, 2023".

I have wondered about a regex approach, but then wondered is using some nlp tool like spacy better? Thanks for any advice - I really appreciate it!

Attached Files

.pdf   MedRequestTemplate_Redacted-min.pdf (Size: 176.14 KB / Downloads: 5)
Reply
#2
(Feb-22-2024, 10:52 PM)standenman Wrote: I am trying to come up with a strategy for extracting key data from generic letters for different clients. This is the format of the letter I want to parse. It should look the same for every client, although there may be minor layout differences. First I want to extract the addressee of the letter which is redacted. Second I want to extract the name of the client. It is over on the right hand margin after "Re:". Then there are 2 items of data I want from the main body of the letter: the time period of the records requested (first sentence after heading "What We Need From You".) Then I want the date in the first sentence of the third paragraph in that heading "Please respond by May 26, 2023".

I have wondered about a regex approach, but then wondered is using some nlp tool like spacy better? Thanks for any advice - I really appreciate it!
I'd use pypdf to read the PDF files and extract the text. If your PDF files are images like the one you attached on your previous post, you may want to OCR it to extract the text using something like pytesseract.

Once you get the text, obtaining the information you need might be trivial. Have you tried something? Do you have any code to show?

import pytesseract
from pdf2image import convert_from_path


PDF_FILE = r"C:\Users\user\Desktop\MedRequestTemplate_Redacted-min.pdf"

# This is the location of the folter containing poppler executable
# needed for pdf2image to work.
POPPLER_LOCATION = r"C:\poppler\Library\bin"

# This is the location of the Tesseract-OCR executable.
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


def generate_texts_from_image_pdf(pdf_path: str, lang: str = "eng") -> str:
    """Performs an OCR in a PDF file and returns it's text content."""
    image = convert_from_path(pdf_path, poppler_path=POPPLER_LOCATION)
    text: str = pytesseract.image_to_string(image[0], lang=lang)
    return text

text = generate_texts_from_image_pdf(PDF_FILE)
print(text)
Output:
P O BOX 149198 AUSTIN TX 78714-9198 Date: May 12, 2023 Case [D: gaily RE: DOB: Vendor Number: | We are the office that makes disability decisions for the Social Security Administration. ie — 7,1 is applying for or is receiving disability benefits due to the following conditions: Lumbar Disfunction. This is not an authorization to perform an examination. What We Need From You To help us evaluate this claim, please send records covering the period of: 08/03/2021 to Present. Include the following information: medical history, psychiatric history, clinical findings, laboratory findings, imaging reports, treatment prescribed and the response, diagnosis, and prognosis. Please respond by May 26, 2023. We are enclosing a signed HIPAA compliant authorization for the release of medical records and information. Please provide a statement based on your findings. Your statement should express your opinion about your patient’s ability to do work-related physical and/or mental activities despite the limitations or restrictions imposed by his medical condition(s). For physical impairments, these activities include sitting, standing, walking, lifting, carrying, pushing, pulling, or other physical activities (including manipulative or postural activities, such as reaching, handling, stooping, or crouching); other activities, such as seeing, hearing, or using other senses; and ability to adapt to environmenta! conditions, such as temperature extremes or fumes, For mental impairments, these activities include understanding; remembering; maintaining concentration, persistence, or pace; carrying out instructions; and responding appropriately to supervision, coworkers, and work pressures. If it is determined that we need additional information regarding your patient's impairments, would you be willing to perform an examination to provide additional findings? Please contact us if you would be willing to perform this examination. We will assume that you do not wish to perform the examination if you do not respond. Tf You Have Any Questions If you have any questions or wish to provide more information, please call us at the number(s) shown below Monday - Friday between 7:00 am and 5:00 pm. When you call or leave a message, please provide the Case [Dy our —_ | a. and a call back number. Thank you for your help. Texas Disability Determination Services/Texas Disability Determination Services (800) 252-7009 (866) 892-9281 (FAX) 67884 18/ Assigned 0643 U15/ DCPS / DCM61025842 / OMB No, 0960-0555 / 98022133
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Address Extraction standenman 7 478 Apr-10-2024, 05:22 PM
Last Post: DPaul
  Python Machine Learning: For Data Extraction JaneTan 0 1,861 Nov-24-2020, 06:45 AM
Last Post: JaneTan
  Backtesting trading strategy Finpyth 1 2,287 Mar-20-2020, 04:32 PM
Last Post: Finpyth
  Feature extraction algorithm lukaznt 1 2,592 Mar-02-2018, 05:16 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020