Python Forum
Can't figure out how to scrape grid
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Can't figure out how to scrape grid
#1
I have the below code to scrape this site (https://oig.hhs.gov/reports-and-publicat...lications/) and extract to a CSV.

It works well to scrape each title and URL. I want it to also scrape the content under "audit", "HHS agency" and "date" for each title, but I can't seem to code it right given all three elements are in a grid.

Any suggestions? Thanks

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

base_url = "https://oig.hhs.gov"
url = "https://oig.hhs.gov/reports-and-publications/all-reports-and-publications/?page={}"

titles_with_urls = []
page = 1
max_pages = 100

while page <= max_pages:
    print(f"Scraping page {page}...")
    response = requests.get(url.format(page))
    soup = BeautifulSoup(response.content, "html.parser")

    # Check if there are titles on the page
    page_titles = soup.select("h2 a")
    if not page_titles:
        print("No more pages to scrape. Exiting...")
        break  # Exit the loop if no titles are found, assuming no more pages

    for title in page_titles:
        title_text = title.text.strip()
        url_link = title.get('href')  # Get the URL link from the 'href' attribute

        # Ensure the URL starts with base_url if it's a relative URL
        if url_link.startswith("/"):
            full_url = urljoin(base_url, url_link)
        else:
            full_url = url_link

        titles_with_urls.append([title_text, full_url])
        print(f"Scraped title: {title_text}, URL: {full_url}")

    page += 1

# Write titles and URLs to a CSV file
with open('titles_with_urls.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "URL"])  # Write header
    writer.writerows(titles_with_urls)

print("All titles and URLs have been scraped and saved to titles_with_urls.csv.")
Reply
#2
you will need to use a scraper that can recognize and click on the audit checkbox, then wait until new page loads prior to deownloaing the new page. Here are some links that will help:

how to locate elements
Click on Checkbox
pageLoadStrategy
Reply
#3
(Jun-13-2024, 09:54 PM)Larz60+ Wrote: you will need to use a scraper that can recognize and click on the audit checkbox, then wait until new page loads prior to deownloaing the new page. Here are some links that will help:

how to locate elements
Click on Checkbox
pageLoadStrategy

Thanks! So I used those sources to create the below code. I'm getting a blank csv though. Not sure what I'm doing wrong.

import requests
from bs4 import BeautifulSoup
import csv

url = "https://oig.hhs.gov/reports-and-publications/all-reports-and-publications/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

reports = soup.find_all("div", class_="media")

report_data = []

for report in reports:
    title = report.find("h3").get_text(strip=True)
    audit = report.find("span", class_="audit").get_text(strip=True) if report.find("span", class_="audit") else "N/A"
    agency = report.find("span", class_="agency").get_text(strip=True) if report.find("span", class_="agency") else "N/A"
    date = report.find("span", class_="date").get_text(strip=True) if report.find("span", class_="date") else "N/A"
    
    report_data.append({
        "Title": title,
        "Audit": audit,
        "Agency": agency,
        "Date": date
    })

# Export to CSV
csv_file = "reports_data.csv"
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=["Title", "Audit", "Agency", "Date"])
    writer.writeheader()
    for data in report_data:
        writer.writerow(data)

print(f"Data exported to {csv_file}")
Reply
#4
I don't see a single div with class 'media' in the source.

If you just choose one sub-page, say the first with title that starts with "CMS Could Strengthen Program"
you will note the surrounding html has

  1. Section starts with div:
    <div class="USA-card__container">
  2. followed by a header, an h2 header, and a link

This pattern repeats for each report on the page,
thus:

1. find all divs with class="USA-card__container"
2. then get the link div.header.h2.a
3. get the page for download page from the 'href' tag.
4. wait for page to load
5. locate the PDF and download

I do not see any csv files.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,561 Mar-13-2020, 07:59 PM
Last Post: alkaline3

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020