BeautifulSoup pagination using href

rhat398 · Jun-29-2021, 10:22 PM

I am trying to scrape all thee events from https://www.onthisday.com/events/february/5 I am getting all the events from first page.How can I get other events from the second page and merge into one list?

Right now I tried to catch the next page link and parse it but it didn't work still getting the results from first page.

Here is my code:

from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    raw_events = [event.text for event in page.select("li.event")]
    if next_link:
        next_url = 'https://www.onthisday.com/events'+next_link['href']
        page_next = _get_page(next_url)
        for eve in page_next.select("li.event"):
            print(eve.text)
    
    #print(raw_events)
    

events_of_the_day("february", 5)

Note:

Some pages contains the next page and some don't so I am looking to handle both the situations.

***snippsat*** · (This post was last modified: Jun-30-2021, 10:57 AM by snippsat.)

Here an idea as days is just an integer then not need to parse it,just generate it outside when call function.
Often can yield can be useful for this,so a generator that just make generator object(no/little memory before call it).
So then look like this.

from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    yield  [event.text for event in page.select("li.event")]

if __name__ == '__main__':
    days = []
    for day in range(5, 9):
        days.append(events_of_the_day("february", day))

Test.

# Just generator object links
>>> days
[<generator object events_of_the_day at 0x00000135CFF3F890>,
 <generator object events_of_the_day at 0x00000135D0A3ED60>,
 <generator object events_of_the_day at 0x00000135D0A3EDD0>,
 <generator object events_of_the_day at 0x00000135D0A3EE40>]

# Call it and will generate the content
>>> list(days[0])
[['816 Frankish emperor Louis grants archbishop Salzburg immunity',
  '1488 Roman Catholic German Emperor Maximilian I caught in Belgium',
  '1512 French troops under Gaston de Foix rescue Bologna, which was under '
  'siege from a combined Papal-Spanish army',
  '1556 Kings Henri I and Philip II sign Treaty of Vaucelles',
  '1572 Beggars assault Oisterwijk Neth, drive nuns out',
  .....

>>> list(days[1])
[['337 St Julius I begins his reign as Catholic Pope',
  '1189 Riots in Lynn, Norfolk (England) spread to Norwich',
  '1508 Maximilian I proclaimed Holy Roman Emperor, 1st Emperor in centuries '
  'not to be crowned by the Pope',
  '1577 King Henri de Bourbon of Navarra becomes leader of the Huguenots',
  '1626 Huguenot rebels & French sign Peace of La Rochelle',
  .....

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Extract Href URL and Text From List	knight2000	2	22,752	Jul-08-2021, 12:53 PM Last Post: knight2000
	Accessing a data-phone tag from an href	KatMac	1	3,737	Apr-27-2021, 06:18 PM Last Post: buran
	Python beautifulsoup pagination error	The61	5	4,668	Apr-09-2020, 09:17 PM Last Post: Larz60+
	How to get the href value of a specific word in the html code	julio2000	2	4,549	Mar-05-2020, 07:50 PM Last Post: julio2000
	Pagination	prejni	2	3,143	Nov-18-2019, 10:45 AM Last Post: alekson
	Scrapy Javascript Pagination (next_page)	nazmulfinance	2	3,978	Nov-18-2019, 01:01 AM Last Post: nazmulfinance
	Web Scraping on href text	Superzaffo	11	9,922	Nov-16-2019, 10:52 AM Last Post: Superzaffo
	pagination for non standarded pages	zarize	12	8,452	Sep-02-2019, 12:35 PM Last Post: zarize
	Python - Scrapy Javascript Pagination (next_page)	Baggelhsk95	3	11,593	Oct-08-2018, 01:20 PM Last Post: stranac
	Scrapy Picking What to Output Href or Img	soothsayerpg	1	3,339	Aug-02-2018, 10:59 AM Last Post: soothsayerpg

BeautifulSoup pagination using href

User Panel Messages

Announcements