Python Forum
BeautifulSoup: 6k records - but stops after parsing 20 lines
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup: 6k records - but stops after parsing 20 lines
#1
dear Python-experts, dear community - dear Snippsat and Larsz60+ Smile


for the sake to create a quick overview on a set of opportunities for free volunteering in Europe:

It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below
- the images and the explanation and description of the aimed goals and the data which are wanted.

we fetch ..
- https://europa.eu/youth/volunteering/organisation/50162
- https://europa.eu/youth/volunteering/organisation/50163

and so forth and so forth.

since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines. see my current

Approach: So I run this mini-approach here:

import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm


first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"


def catch(url):
    with requests.Session() as req:
        pages = []
        print("Loading All IDS\n")
        for item in tqdm(range(0, 347)):
            r = req.get(url.format(item))
            soup = BeautifulSoup(r.content, 'html.parser')
            numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
                "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
            pages.append(numbers)
        return numbers


def parse(url):
    links = catch(first)
    with requests.Session() as req:
        with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Address", "Site", "Phone",
                             "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
            print("\nParsing Now... \n")
            for link in tqdm(links):
                r = req.get(url.format(link))
                soup = BeautifulSoup(r.content, 'html.parser')
                task = soup.find("section", class_="col-sm-12").contents
                name = task[1].text
                add = task[3].find(
                    "i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
                try:
                    site = task[3].find("a", class_="link-default").get("href")
                except:
                    site = "N/A"
                try:
                    phone = task[3].find(
                        "i", class_="fa fa-phone").next_element.strip()
                except:
                    phone = "N/A"
                desc = task[3].find(
                    "h3", class_="eyp-project-heading underline").find_next("p").text
                scope = task[3].findAll("span", class_="pull-right")[1].text
                rec = task[3].select("tbody td")[1].text
                send = task[3].select("tbody td")[-1].text
                pic = task[3].select(
                    "span.vertical-space")[0].text.split(" ")[1]
                oid = task[3].select(
                    "span.vertical-space")[-1].text.split(" ")[1]
                topic = [item.next_element.strip() for item in task[3].select(
                    "i.fa.fa-check.fa-lg")]
                writer.writerow([name, add, site, phone, desc,
                                 scope, rec, send, pic, oid, "".join(topic)])


parse(second)
but this stops after parsing 20 results

note; i am wanting to return pages not numbers.

   pages.append(numbers)
        return numbers
 
here i seem o have some mistakes:

i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted.
i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers


That said: since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.

any idea - how to get the parser to give out all the 6k results.


Smile
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup not parsing other URLs giddyhead 0 1,169 Feb-23-2022, 05:35 PM
Last Post: giddyhead
  AttributeError: 'NoneType' object in a parser - stops it apollo 4 3,964 May-28-2021, 02:13 PM
Last Post: Daring_T
  Logic behind BeautifulSoup data-parsing jimsxxl 7 4,222 Apr-13-2021, 09:06 AM
Last Post: jimsxxl
  Code stops after 20min+ with no output JacobK 1 1,693 Apr-03-2020, 07:01 PM
Last Post: Larz60+
  beautifulsoup :: possible to rssify content - of let us say 10 or 20 last records!? apollo 1 1,883 Feb-12-2020, 03:13 PM
Last Post: snippsat
  BeautifulSoup Parsing Error slinkplink 6 9,456 Feb-12-2018, 02:55 PM
Last Post: seco
  Beautifulsoup parsing Larz60+ 7 6,001 Apr-05-2017, 03:07 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020