![]() |
Can't figure out how to scrape grid - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Can't figure out how to scrape grid (/thread-42304.html) |
Can't figure out how to scrape grid - templeowls - Jun-13-2024 I have the below code to scrape this site (https://oig.hhs.gov/reports-and-publications/all-reports-and-publications/) and extract to a CSV. It works well to scrape each title and URL. I want it to also scrape the content under "audit", "HHS agency" and "date" for each title, but I can't seem to code it right given all three elements are in a grid. Any suggestions? Thanks import requests from bs4 import BeautifulSoup import csv from urllib.parse import urljoin base_url = "https://oig.hhs.gov" url = "https://oig.hhs.gov/reports-and-publications/all-reports-and-publications/?page={}" titles_with_urls = [] page = 1 max_pages = 100 while page <= max_pages: print(f"Scraping page {page}...") response = requests.get(url.format(page)) soup = BeautifulSoup(response.content, "html.parser") # Check if there are titles on the page page_titles = soup.select("h2 a") if not page_titles: print("No more pages to scrape. Exiting...") break # Exit the loop if no titles are found, assuming no more pages for title in page_titles: title_text = title.text.strip() url_link = title.get('href') # Get the URL link from the 'href' attribute # Ensure the URL starts with base_url if it's a relative URL if url_link.startswith("/"): full_url = urljoin(base_url, url_link) else: full_url = url_link titles_with_urls.append([title_text, full_url]) print(f"Scraped title: {title_text}, URL: {full_url}") page += 1 # Write titles and URLs to a CSV file with open('titles_with_urls.csv', mode='w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(["Title", "URL"]) # Write header writer.writerows(titles_with_urls) print("All titles and URLs have been scraped and saved to titles_with_urls.csv.") RE: Can't figure out how to scrape grid - Larz60+ - Jun-13-2024 you will need to use a scraper that can recognize and click on the audit checkbox, then wait until new page loads prior to deownloaing the new page. Here are some links that will help: how to locate elements Click on Checkbox pageLoadStrategy RE: Can't figure out how to scrape grid - templeowls - Jun-18-2024 (Jun-13-2024, 09:54 PM)Larz60+ Wrote: you will need to use a scraper that can recognize and click on the audit checkbox, then wait until new page loads prior to deownloaing the new page. Here are some links that will help: Thanks! So I used those sources to create the below code. I'm getting a blank csv though. Not sure what I'm doing wrong. import requests from bs4 import BeautifulSoup import csv url = "https://oig.hhs.gov/reports-and-publications/all-reports-and-publications/" response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") reports = soup.find_all("div", class_="media") report_data = [] for report in reports: title = report.find("h3").get_text(strip=True) audit = report.find("span", class_="audit").get_text(strip=True) if report.find("span", class_="audit") else "N/A" agency = report.find("span", class_="agency").get_text(strip=True) if report.find("span", class_="agency") else "N/A" date = report.find("span", class_="date").get_text(strip=True) if report.find("span", class_="date") else "N/A" report_data.append({ "Title": title, "Audit": audit, "Agency": agency, "Date": date }) # Export to CSV csv_file = "reports_data.csv" with open(csv_file, mode='w', newline='', encoding='utf-8') as file: writer = csv.DictWriter(file, fieldnames=["Title", "Audit", "Agency", "Date"]) writer.writeheader() for data in report_data: writer.writerow(data) print(f"Data exported to {csv_file}") RE: Can't figure out how to scrape grid - Larz60+ - Jun-19-2024 I don't see a single div with class 'media' in the source. If you just choose one sub-page, say the first with title that starts with "CMS Could Strengthen Program" you will note the surrounding html has
This pattern repeats for each report on the page, thus: 1. find all divs with class="USA-card__container" 2. then get the link div.header.h2.a 3. get the page for download page from the 'href' tag. 4. wait for page to load 5. locate the PDF and download I do not see any csv files. |