Python Forum
BeautifulSoup pagination using href
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup pagination using href
#1
I am trying to scrape all thee events from https://www.onthisday.com/events/february/5 I am getting all the events from first page.How can I get other events from the second page and merge into one list?

Right now I tried to catch the next page link and parse it but it didn't work still getting the results from first page.

Here is my code:

from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    raw_events = [event.text for event in page.select("li.event")]
    if next_link:
        next_url = 'https://www.onthisday.com/events'+next_link['href']
        page_next = _get_page(next_url)
        for eve in page_next.select("li.event"):
            print(eve.text)
    
    #print(raw_events)
    

events_of_the_day("february", 5)
Note:

Some pages contains the next page and some don't so I am looking to handle both the situations.
Reply
#2
Here an idea as days is just an integer then not need to parse it,just generate it outside when call function.
Often can yield can be useful for this,so a generator that just make generator object(no/little memory before call it).
So then look like this.
from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    yield  [event.text for event in page.select("li.event")]

if __name__ == '__main__':
    days = []
    for day in range(5, 9):
        days.append(events_of_the_day("february", day))
Test.
# Just generator object links
>>> days
[<generator object events_of_the_day at 0x00000135CFF3F890>,
 <generator object events_of_the_day at 0x00000135D0A3ED60>,
 <generator object events_of_the_day at 0x00000135D0A3EDD0>,
 <generator object events_of_the_day at 0x00000135D0A3EE40>]

# Call it and will generate the content
>>> list(days[0])
[['816 Frankish emperor Louis grants archbishop Salzburg immunity',
  '1488 Roman Catholic German Emperor Maximilian I caught in Belgium',
  '1512 French troops under Gaston de Foix rescue Bologna, which was under '
  'siege from a combined Papal-Spanish army',
  '1556 Kings Henri I and Philip II sign Treaty of Vaucelles',
  '1572 Beggars assault Oisterwijk Neth, drive nuns out',
  .....

>>> list(days[1])
[['337 St Julius I begins his reign as Catholic Pope',
  '1189 Riots in Lynn, Norfolk (England) spread to Norwich',
  '1508 Maximilian I proclaimed Holy Roman Emperor, 1st Emperor in centuries '
  'not to be crowned by the Pope',
  '1577 King Henri de Bourbon of Navarra becomes leader of the Huguenots',
  '1626 Huguenot rebels & French sign Peace of La Rochelle',
  .....
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract Href URL and Text From List knight2000 2 8,961 Jul-08-2021, 12:53 PM
Last Post: knight2000
  Accessing a data-phone tag from an href KatMac 1 2,886 Apr-27-2021, 06:18 PM
Last Post: buran
  Python beautifulsoup pagination error The61 5 3,457 Apr-09-2020, 09:17 PM
Last Post: Larz60+
  How to get the href value of a specific word in the html code julio2000 2 3,201 Mar-05-2020, 07:50 PM
Last Post: julio2000
  Pagination prejni 2 2,392 Nov-18-2019, 10:45 AM
Last Post: alekson
  Scrapy Javascript Pagination (next_page) nazmulfinance 2 3,021 Nov-18-2019, 01:01 AM
Last Post: nazmulfinance
  Web Scraping on href text Superzaffo 11 7,341 Nov-16-2019, 10:52 AM
Last Post: Superzaffo
  pagination for non standarded pages zarize 12 5,994 Sep-02-2019, 12:35 PM
Last Post: zarize
  Python - Scrapy Javascript Pagination (next_page) Baggelhsk95 3 9,989 Oct-08-2018, 01:20 PM
Last Post: stranac
  Scrapy Picking What to Output Href or Img soothsayerpg 1 2,702 Aug-02-2018, 10:59 AM
Last Post: soothsayerpg

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020