BeautifulSoup: 6k records - but stops after parsing 20 lines

apollo · (This post was last modified: May-10-2021, 05:08 PM by apollo.)

dear Python-experts, dear community - dear Snippsat and Larsz60+ Smile

for the sake to create a quick overview on a set of opportunities for free volunteering in Europe:

It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below
- the images and the explanation and description of the aimed goals and the data which are wanted.

we fetch ..
- https://europa.eu/youth/volunteering/organisation/50162
- https://europa.eu/youth/volunteering/organisation/50163

and so forth and so forth.

since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current

Approach: So I run this mini-approach here:

import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm


first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"


def catch(url):
    with requests.Session() as req:
        pages = []
        print("Loading All IDS\n")
        for item in tqdm(range(0, 347)):
            r = req.get(url.format(item))
            soup = BeautifulSoup(r.content, 'html.parser')
            numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
                "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
            pages.append(numbers)
        return numbers


def parse(url):
    links = catch(first)
    with requests.Session() as req:
        with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Address", "Site", "Phone",
                             "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
            print("\nParsing Now... \n")
            for link in tqdm(links):
                r = req.get(url.format(link))
                soup = BeautifulSoup(r.content, 'html.parser')
                task = soup.find("section", class_="col-sm-12").contents
                name = task[1].text
                add = task[3].find(
                    "i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
                try:
                    site = task[3].find("a", class_="link-default").get("href")
                except:
                    site = "N/A"
                try:
                    phone = task[3].find(
                        "i", class_="fa fa-phone").next_element.strip()
                except:
                    phone = "N/A"
                desc = task[3].find(
                    "h3", class_="eyp-project-heading underline").find_next("p").text
                scope = task[3].findAll("span", class_="pull-right")[1].text
                rec = task[3].select("tbody td")[1].text
                send = task[3].select("tbody td")[-1].text
                pic = task[3].select(
                    "span.vertical-space")[0].text.split(" ")[1]
                oid = task[3].select(
                    "span.vertical-space")[-1].text.split(" ")[1]
                topic = [item.next_element.strip() for item in task[3].select(
                    "i.fa.fa-check.fa-lg")]
                writer.writerow([name, add, site, phone, desc,
                                 scope, rec, send, pic, oid, "".join(topic)])


parse(second)

but this stops after parsing 20 results

note; i am wanting to return pages not numbers.

   pages.append(numbers)
        return numbers

here i seem o have some mistakes:

i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted.
i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers

That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.

any idea - how to get the parser to give out all the 6k results.

Smile

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup not parsing other URLs	giddyhead	0	1,798	Feb-23-2022, 05:35 PM Last Post: giddyhead
	AttributeError: 'NoneType' object in a parser - stops it	apollo	4	5,277	May-28-2021, 02:13 PM Last Post: Daring_T
	Logic behind BeautifulSoup data-parsing	jimsxxl	7	6,002	Apr-13-2021, 09:06 AM Last Post: jimsxxl
	Code stops after 20min+ with no output	JacobK	1	2,279	Apr-03-2020, 07:01 PM Last Post: Larz60+
	beautifulsoup :: possible to rssify content - of let us say 10 or 20 last records!?	apollo	1	2,415	Feb-12-2020, 03:13 PM Last Post: snippsat
	BeautifulSoup Parsing Error	slinkplink	6	13,102	Feb-12-2018, 02:55 PM Last Post: seco
	Beautifulsoup parsing	Larz60+	7	7,469	Apr-05-2017, 03:07 AM Last Post: Larz60+

BeautifulSoup: 6k records - but stops after parsing 20 lines

User Panel Messages

Announcements