Python Forum
Scrape medical information from MedlinePlus
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrape medical information from MedlinePlus
#1
Hi friends
I am trying to achieve the following tasks:

1-Scrape the list of diseases from the MedlinePlus Medical Encyclopedia page.
2-For each disease, navigate to its page and extract the relevant information (name, symptoms, treatment).
3-Store this information in a structured format (e.g., a dictionary or a DataFrame) for later use in the chatbot.

I will highly appreciate any comment about the following program to achieve the above tasks by ChatGPT;

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# URL of the MedlinePlus Medical Encyclopedia
base_url = "https://medlineplus.gov/encyclopedia.html"

# Function to get the list of disease links from the main page
def get_disease_links(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the disease links
    disease_links = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.startswith('/ency/article'):
            disease_links.append("https://medlineplus.gov" + href)
    
    return disease_links

# Function to extract disease information from a given disease page
def extract_disease_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract disease name
    name = soup.find('h1').text.strip()
    
    # Extract symptoms and treatment
    symptoms, treatment = "", ""
    for header in soup.find_all('h2'):
        if 'Symptoms' in header.text:
            symptoms = header.find_next('p').text.strip()
        if 'Treatment' in header.text:
            treatment = header.find_next('p').text.strip()
    
    return {"name": name, "symptoms": symptoms, "treatment": treatment}

# Main script
if __name__ == "__main__":
    disease_links = get_disease_links(base_url)
    all_disease_info = []

    for link in disease_links:
        try:
            disease_info = extract_disease_info(link)
            all_disease_info.append(disease_info)
            print(f"Extracted info for: {disease_info['name']}")
            time.sleep(1)  # Be polite and don't overload the server
        except Exception as e:
            print(f"Failed to extract info from {link}: {e}")

    # Save the extracted information to a CSV file
    df = pd.DataFrame(all_disease_info)
    df.to_csv('diseases_info.csv', index=False)
    print("Saved disease information to 'diseases_info.csv'.")
Larz60+ write Jun-21-2024, 08:06 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Tags have been added for you this time. Please use BBCode tags on future posts.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How do I scrape profile information from Twitter People search results? asdad 0 833 Nov-29-2022, 10:25 AM
Last Post: asdad
  Code scrape more than one time information Clnprof 5 2,731 Aug-26-2019, 09:26 AM
Last Post: Clnprof

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020