Sep-15-2021, 06:02 PM
Greetings,
I am writing a small program that will save an author to a CSV file. However, in the function
Is what I did optimal or is there a better way?
Keep in mind that there could have been x number of pages. So, I checked for the "next" button to make sure it was not there, thus the last page.
I suppose I could have searched the html from the response for the last page with regex. I tried making a variable for:
I am writing a small program that will save an author to a CSV file. However, in the function
get_soupI had to return a boolean because I had no way to break out of the while loop. If I could have, I would have just returned the set_authors set.
Is what I did optimal or is there a better way?
Keep in mind that there could have been x number of pages. So, I checked for the "next" button to make sure it was not there, thus the last page.
I suppose I could have searched the html from the response for the last page with regex. I tried making a variable for:
bool_break = True while bool_break:Then tried to change it from the get_soup function but that didn't work.
import requests import bs4 HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Language': 'en-US,en;q=0.9', 'Referer': 'https://google.com', 'DNT': '1', } BASE_URL = 'https://quotes.toscrape.com/page/{}/' def get_html(BASE_URL, current_page, ses): res = ses.get(BASE_URL.format(current_page), headers=HEADERS) return res def get_soup(res_text, set_authors): soup = bs4.BeautifulSoup(res_text, 'lxml') # Search for all of the authors for name in soup.select('.author'): # add each author's link text to a set to remove duplicates. set_authors.add(name.text) if not(soup.select('li.next')): # Found last page return True # Need to break out of outter while loop or I would have just returned the set_authors def parse(): ses = requests.Session() set_authors = set() current_page = 1 while True: res = get_html(BASE_URL, current_page, ses) if res.status_code == 200: if get_soup(res.text, set_authors): break current_page += 1 else: print('error') break for author in set_authors: print(author) parse() # list_sort = list(set_authors) # list_sort.sort() # for author in list_sort: # print(author)