Sep-16-2021, 03:37 PM
(This post was last modified: Sep-16-2021, 03:41 PM by deanhystad.)
I would refactor like this:
import requests from bs4 import BeautifulSoup # Or could import bs4 as bs HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Language': 'en-US,en;q=0.9', 'Referer': 'https://google.com', 'DNT': '1', } BASE_URL = 'https://quotes.toscrape.com/page/{}/' def find_tag(tag, url, headers): '''Iterator to get text for matching tags from url''' base_session = requests.Session() page = 1 while True: page_session = base_session.get(url.format(page), headers=headers) # Does not need to have own function if page_session.status_code != 200: continue # avoid walking code off the page with multiple levels of indentation soup = BeautifulSoup(page_session.text, 'lxml') # Avoid embedding package version info in code. for name in soup.select(tag): yield name.text if not(soup.select('li.next')): break # Found last page. All done page += 1 authors = [author for author in find_tag('.author', BASE_URL, HEADERS)] for author in sorted(set(authors)): print(author)I don't have beautiful soup or requests installed, so this is completely untested. It may also miss important steps and this may not be the right way to extract tags. I just refactored your code.