BeautifulSoup: 6k records - but stops after parsing 20 lines - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: BeautifulSoup: 6k records - but stops after parsing 20 lines (/thread-33597.html) |
BeautifulSoup: 6k records - but stops after parsing 20 lines - apollo - May-10-2021 dear Python-experts, dear community - dear Snippsat and Larsz60+ for the sake to create a quick overview on a set of opportunities for free volunteering in Europe: It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below - the images and the explanation and description of the aimed goals and the data which are wanted. we fetch .. - https://europa.eu/youth/volunteering/organisation/50162 - https://europa.eu/youth/volunteering/organisation/50163 and so forth and so forth. since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current Approach: So I run this mini-approach here: import requests from bs4 import BeautifulSoup import re import csv from tqdm import tqdm first = "https://europa.eu/youth/volunteering/organisations_en?page={}" second = "https://europa.eu/youth/volunteering/organisation/{}_en" def catch(url): with requests.Session() as req: pages = [] print("Loading All IDS\n") for item in tqdm(range(0, 347)): r = req.get(url.format(item)) soup = BeautifulSoup(r.content, 'html.parser') numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll( "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")] pages.append(numbers) return numbers def parse(url): links = catch(first) with requests.Session() as req: with open("Data.csv", 'w', newline="", encoding="UTF-8") as f: writer = csv.writer(f) writer.writerow(["Name", "Address", "Site", "Phone", "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"]) print("\nParsing Now... \n") for link in tqdm(links): r = req.get(url.format(link)) soup = BeautifulSoup(r.content, 'html.parser') task = soup.find("section", class_="col-sm-12").contents name = task[1].text add = task[3].find( "i", class_="fa fa-location-arrow fa-lg").parent.text.strip() try: site = task[3].find("a", class_="link-default").get("href") except: site = "N/A" try: phone = task[3].find( "i", class_="fa fa-phone").next_element.strip() except: phone = "N/A" desc = task[3].find( "h3", class_="eyp-project-heading underline").find_next("p").text scope = task[3].findAll("span", class_="pull-right")[1].text rec = task[3].select("tbody td")[1].text send = task[3].select("tbody td")[-1].text pic = task[3].select( "span.vertical-space")[0].text.split(" ")[1] oid = task[3].select( "span.vertical-space")[-1].text.split(" ")[1] topic = [item.next_element.strip() for item in task[3].select( "i.fa.fa-check.fa-lg")] writer.writerow([name, add, site, phone, desc, scope, rec, send, pic, oid, "".join(topic)]) parse(second)but this stops after parsing 20 results note; i am wanting to return pages not numbers. pages.append(numbers) return numbershere i seem o have some mistakes: i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted. i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results. any idea - how to get the parser to give out all the 6k results. |