May-10-2021, 05:08 PM
dear Python-experts, dear community - dear Snippsat and Larsz60+
for the sake to create a quick overview on a set of opportunities for free volunteering in Europe:
It is aimed to get all the 6k target-pages: see below
- the images and the explanation and description of the aimed goals and the data which are wanted.
we fetch ..
and so forth and so forth.
since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current
Approach: So I run this mini-approach here:
note; i am wanting to return pages not numbers.
i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted.
i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers
That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.
any idea - how to get the parser to give out all the 6k results.
for the sake to create a quick overview on a set of opportunities for free volunteering in Europe:
It is aimed to get all the 6k target-pages: see below
- the images and the explanation and description of the aimed goals and the data which are wanted.
we fetch ..
and so forth and so forth.
since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current
Approach: So I run this mini-approach here:
import requests from bs4 import BeautifulSoup import re import csv from tqdm import tqdm first = "{}" second = "{}_en" def catch(url): with requests.Session() as req: pages = [] print("Loading All IDS\n") for item in tqdm(range(0, 347)): r = req.get(url.format(item)) soup = BeautifulSoup(r.content, 'html.parser') numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll( "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")] pages.append(numbers) return numbers def parse(url): links = catch(first) with requests.Session() as req: with open("Data.csv", 'w', newline="", encoding="UTF-8") as f: writer = csv.writer(f) writer.writerow(["Name", "Address", "Site", "Phone", "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"]) print("\nParsing Now... \n") for link in tqdm(links): r = req.get(url.format(link)) soup = BeautifulSoup(r.content, 'html.parser') task = soup.find("section", class_="col-sm-12").contents name = task[1].text add = task[3].find( "i", class_="fa fa-location-arrow fa-lg").parent.text.strip() try: site = task[3].find("a", class_="link-default").get("href") except: site = "N/A" try: phone = task[3].find( "i", class_="fa fa-phone").next_element.strip() except: phone = "N/A" desc = task[3].find( "h3", class_="eyp-project-heading underline").find_next("p").text scope = task[3].findAll("span", class_="pull-right")[1].text rec = task[3].select("tbody td")[1].text send = task[3].select("tbody td")[-1].text pic = task[3].select( "span.vertical-space")[0].text.split(" ")[1] oid = task[3].select( "span.vertical-space")[-1].text.split(" ")[1] topic = [item.next_element.strip() for item in task[3].select( "i.fa.fa-check.fa-lg")] writer.writerow([name, add, site, phone, desc, scope, rec, send, pic, oid, "".join(topic)]) parse(second)but this stops after parsing 20 results
note; i am wanting to return pages not numbers.
pages.append(numbers) return numbershere i seem o have some mistakes:
i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted.
i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers
That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.
any idea - how to get the parser to give out all the 6k results.