Scrape medical information from MedlinePlus

Nawahda · (This post was last modified: Jun-21-2024, 08:06 PM by Larz60+.)

Hi friends
I am trying to achieve the following tasks:

1-Scrape the list of diseases from the MedlinePlus Medical Encyclopedia page.
2-For each disease, navigate to its page and extract the relevant information (name, symptoms, treatment).
3-Store this information in a structured format (e.g., a dictionary or a DataFrame) for later use in the chatbot.

I will highly appreciate any comment about the following program to achieve the above tasks by ChatGPT;

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# URL of the MedlinePlus Medical Encyclopedia
base_url = "https://medlineplus.gov/encyclopedia.html"

# Function to get the list of disease links from the main page
def get_disease_links(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the disease links
    disease_links = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.startswith('/ency/article'):
            disease_links.append("https://medlineplus.gov" + href)
    
    return disease_links

# Function to extract disease information from a given disease page
def extract_disease_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract disease name
    name = soup.find('h1').text.strip()
    
    # Extract symptoms and treatment
    symptoms, treatment = "", ""
    for header in soup.find_all('h2'):
        if 'Symptoms' in header.text:
            symptoms = header.find_next('p').text.strip()
        if 'Treatment' in header.text:
            treatment = header.find_next('p').text.strip()
    
    return {"name": name, "symptoms": symptoms, "treatment": treatment}

# Main script
if __name__ == "__main__":
    disease_links = get_disease_links(base_url)
    all_disease_info = []

    for link in disease_links:
        try:
            disease_info = extract_disease_info(link)
            all_disease_info.append(disease_info)
            print(f"Extracted info for: {disease_info['name']}")
            time.sleep(1)  # Be polite and don't overload the server
        except Exception as e:
            print(f"Failed to extract info from {link}: {e}")

    # Save the extracted information to a CSV file
    df = pd.DataFrame(all_disease_info)
    df.to_csv('diseases_info.csv', index=False)
    print("Saved disease information to 'diseases_info.csv'.")

Larz60+ write Jun-21-2024, 08:06 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Tags have been added for you this time. Please use BBCode tags on future posts.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How do I scrape profile information from Twitter People search results?	asdad	0	841	Nov-29-2022, 10:25 AM Last Post: asdad
	Code scrape more than one time information	Clnprof	5	2,732	Aug-26-2019, 09:26 AM Last Post: Clnprof

Scrape medical information from MedlinePlus

User Panel Messages

Announcements