(Feb-26-2024, 08:16 PM)templeowls Wrote: It works well but I also want it to open each url and scrape the full text on the page for each. Any suggestions on how to modify this code to achieve?It would be messy and try to integrate it in code you already have.
Tips make it work separate first or keep all separate then add to csv at end.
So most eg as i start with here make complete links for all articles.
import requests from bs4 import BeautifulSoup page_nr = 1 url = f"https://www.eeoc.gov/newsroom/search?page={page_nr}" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') all_link = soup.select('article > h2 > a') all_href = [a['href'] for a in all_link] # Make complete links base_url = 'https://www.eeoc.gov' news_links = [] for link in all_href: print(f'{base_url}{link}') news_links.append(f'{base_url}{link}')
Output:https://www.eeoc.gov/newsroom/tc-wheelers-pay-25000-settle-eeoc-sex-harassment-lawsuit
https://www.eeoc.gov/newsroom/trinity-health-michigan-pay-50000-settle-eeoc-religious-discrimination-lawsuit
https://www.eeoc.gov/newsroom/cash-depot-pays-55000-settle-eeoc-disability-discrimination-lawsuit
https://www.eeoc.gov/newsroom/nebraska-court-orders-trucking-company-pay-deaf-driver-punitive-damages-lost-wages-after
......
Now that this is done can iterate over news_links
and open in requests/BS and parser articles.