May-10-2021, 05:08 PM
dear Python-experts, dear community - dear Snippsat and Larsz60+
for the sake to create a quick overview on a set of opportunities for free volunteering in Europe:
It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below
- the images and the explanation and description of the aimed goals and the data which are wanted.
we fetch ..
- https://europa.eu/youth/volunteering/organisation/50162
- https://europa.eu/youth/volunteering/organisation/50163
and so forth and so forth.
since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current
Approach: So I run this mini-approach here:
note; i am wanting to return pages not numbers.
i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted.
i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers
That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.
any idea - how to get the parser to give out all the 6k results.
for the sake to create a quick overview on a set of opportunities for free volunteering in Europe:
It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below
- the images and the explanation and description of the aimed goals and the data which are wanted.
we fetch ..
- https://europa.eu/youth/volunteering/organisation/50162
- https://europa.eu/youth/volunteering/organisation/50163
and so forth and so forth.
since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current
Approach: So I run this mini-approach here:
import requests from bs4 import BeautifulSoup import re import csv from tqdm import tqdm first = "https://europa.eu/youth/volunteering/organisations_en?page={}" second = "https://europa.eu/youth/volunteering/organisation/{}_en" def catch(url): with requests.Session() as req: pages = [] print("Loading All IDS\n") for item in tqdm(range(0, 347)): r = req.get(url.format(item)) soup = BeautifulSoup(r.content, 'html.parser') numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll( "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")] pages.append(numbers) return numbers def parse(url): links = catch(first) with requests.Session() as req: with open("Data.csv", 'w', newline="", encoding="UTF-8") as f: writer = csv.writer(f) writer.writerow(["Name", "Address", "Site", "Phone", "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"]) print("\nParsing Now... \n") for link in tqdm(links): r = req.get(url.format(link)) soup = BeautifulSoup(r.content, 'html.parser') task = soup.find("section", class_="col-sm-12").contents name = task[1].text add = task[3].find( "i", class_="fa fa-location-arrow fa-lg").parent.text.strip() try: site = task[3].find("a", class_="link-default").get("href") except: site = "N/A" try: phone = task[3].find( "i", class_="fa fa-phone").next_element.strip() except: phone = "N/A" desc = task[3].find( "h3", class_="eyp-project-heading underline").find_next("p").text scope = task[3].findAll("span", class_="pull-right")[1].text rec = task[3].select("tbody td")[1].text send = task[3].select("tbody td")[-1].text pic = task[3].select( "span.vertical-space")[0].text.split(" ")[1] oid = task[3].select( "span.vertical-space")[-1].text.split(" ")[1] topic = [item.next_element.strip() for item in task[3].select( "i.fa.fa-check.fa-lg")] writer.writerow([name, add, site, phone, desc, scope, rec, send, pic, oid, "".join(topic)]) parse(second)but this stops after parsing 20 results
note; i am wanting to return pages not numbers.
pages.append(numbers) return numbershere i seem o have some mistakes:
i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted.
i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers
That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.
any idea - how to get the parser to give out all the 6k results.