Sep-17-2021, 05:48 PM
I cleaned it up a little based on some of your recommendations and my own:
This was not necessary because I had a break before it:
The
This was not necessary because I had a break before it:
if not(soup.select_one('li.next')): break else: current_page += 1I also returned a list instead of a set from get_content function
The
get_contentfunction returns a list. Then I decided to return it as a sorted set. I was not aware of the '
sorted' keyword.
import requests import bs4 as bs import csv HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Language': 'en-US,en;q=0.9', 'Referer': 'https://google.com', 'DNT': '1', } BASE_URL = 'https://quotes.toscrape.com/page/{}/' def get_html(BASE_URL, current_page, base_session): #Get request response = base_session.get(BASE_URL.format(current_page), headers=HEADERS) return response def get_soup(soup, list_authors, selector): # Search for all of the authors for name in soup.select(selector): list_authors.append(name.text) return list_authors def save_csv(list_authors, filename): #Sort list alphabetically list_sorted = sorted(set(list_authors)) #save to CSV Code # with open(filename, 'w', encoding='utf-8', newline='') as csvfile: # writer = csv.writer(csvfile, delimiter=',') # writer.writerow(['Author']) for author in list_sorted: # writer.writerow([author]) print(author) def parse(): #Global session object base_session = requests.Session() list_authors = [] current_page = 1 while True: page_session = get_html(BASE_URL, current_page, base_session) if page_session.status_code != 200: print('error') break soup = bs.BeautifulSoup(page_session.text, 'lxml') list_result = get_soup(soup, list_authors, '.author') if not(soup.select_one('li.next')): # I had a correspond else statement else: current_page += 1 that was redundant. If we don't break # it's safe to iterate current page break current_page += 1 save_csv(list_result, 'example.csv') if __name__ == '__main__': parse()Your way works great but I wanted to break it up into functions even if it didn't warrant it. Thanks for all of your help. You made me think of a more abstract approach.