Need help opening pages when web scraping - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Need help opening pages when web scraping (/thread-41668.html) |
Need help opening pages when web scraping - templeowls - Feb-26-2024 I have the below code which scrapes this page: https://www.eeoc.gov/newsroom/search. It works well but I also want it to open each url and scrape the full text on the page for each. Any suggestions on how to modify this code to achieve? import csv import requests from bs4 import BeautifulSoup def scrape_eec_news(): base_url = "https://www.eeoc.gov/newsroom/search?page=" results = [] page_number = 0 while True: page_number += 1 url = base_url + str(page_number) response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.content, "html.parser") entries = soup.find_all("div", class_="views-row") if not entries: break print("Scraping page", page_number) # Print the page number for entry in entries: title_elem = entry.h2 description_elem = entry.p date_elem = entry.find("div", class_="field--type-datetime") url_elem = entry.a title = title_elem.text.strip() description = description_elem.text.strip() if description_elem else "" date = date_elem.text.strip() if date_elem else "" # Check if date_elem is not None url = url_elem["href"] # Add the 'agency' column with the value "United States Equal Employment Opportunity Commission" results.append( { "title": title, "description": description, "date": date, "url": url, "agency": "United States Equal Employment Opportunity Commission" } ) return results def export_to_csv(data, filename): with open(filename, "w", newline="", encoding="utf-8") as csvfile: fieldnames = ["title", "description", "date", "url", "agency"] # Include 'agency' in the fieldnames writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for entry in data: writer.writerow(entry) if __name__ == "__main__": news_entries = scrape_eec_news() export_to_csv(news_entries, "eec_news.csv") print("Data exported to eec_news.csv") RE: Need help opening pages when web scraping - snippsat - Feb-29-2024 (Feb-26-2024, 08:16 PM)templeowls Wrote: It works well but I also want it to open each url and scrape the full text on the page for each. Any suggestions on how to modify this code to achieve?It would be messy and try to integrate it in code you already have. Tips make it work separate first or keep all separate then add to csv at end. So most eg as i start with here make complete links for all articles. import requests from bs4 import BeautifulSoup page_nr = 1 url = f"https://www.eeoc.gov/newsroom/search?page={page_nr}" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') all_link = soup.select('article > h2 > a') all_href = [a['href'] for a in all_link] # Make complete links base_url = 'https://www.eeoc.gov' news_links = [] for link in all_href: print(f'{base_url}{link}') news_links.append(f'{base_url}{link}') Now that this is done can iterate over news_links and open in requests/BS and parser articles.
|