Python Forum

Full Version: BeautifulSoup pagination using href
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am trying to scrape all thee events from https://www.onthisday.com/events/february/5 I am getting all the events from first page.How can I get other events from the second page and merge into one list?

Right now I tried to catch the next page link and parse it but it didn't work still getting the results from first page.

Here is my code:

from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    raw_events = [event.text for event in page.select("li.event")]
    if next_link:
        next_url = 'https://www.onthisday.com/events'+next_link['href']
        page_next = _get_page(next_url)
        for eve in page_next.select("li.event"):
            print(eve.text)
    
    #print(raw_events)
    

events_of_the_day("february", 5)
Note:

Some pages contains the next page and some don't so I am looking to handle both the situations.
Here an idea as days is just an integer then not need to parse it,just generate it outside when call function.
Often can yield can be useful for this,so a generator that just make generator object(no/little memory before call it).
So then look like this.
from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    yield  [event.text for event in page.select("li.event")]

if __name__ == '__main__':
    days = []
    for day in range(5, 9):
        days.append(events_of_the_day("february", day))
Test.
# Just generator object links
>>> days
[<generator object events_of_the_day at 0x00000135CFF3F890>,
 <generator object events_of_the_day at 0x00000135D0A3ED60>,
 <generator object events_of_the_day at 0x00000135D0A3EDD0>,
 <generator object events_of_the_day at 0x00000135D0A3EE40>]

# Call it and will generate the content
>>> list(days[0])
[['816 Frankish emperor Louis grants archbishop Salzburg immunity',
  '1488 Roman Catholic German Emperor Maximilian I caught in Belgium',
  '1512 French troops under Gaston de Foix rescue Bologna, which was under '
  'siege from a combined Papal-Spanish army',
  '1556 Kings Henri I and Philip II sign Treaty of Vaucelles',
  '1572 Beggars assault Oisterwijk Neth, drive nuns out',
  .....

>>> list(days[1])
[['337 St Julius I begins his reign as Catholic Pope',
  '1189 Riots in Lynn, Norfolk (England) spread to Norwich',
  '1508 Maximilian I proclaimed Holy Roman Emperor, 1st Emperor in centuries '
  'not to be crowned by the Pope',
  '1577 King Henri de Bourbon of Navarra becomes leader of the Huguenots',
  '1626 Huguenot rebels & French sign Peace of La Rochelle',
  .....