Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Advancing Page Numbers
#1
Hi all,

I want to be able to scrape different pages and categories within the same website. I have this code so far:

from bs4 import BeautifulSoup
import requests



cats =['romance_8', 'childrens_11']
page_number = 1




for cat in cats:
    while True:
        url = f'https://books.toscrape.com/catalogue/category/books/{cat}/page-{page_number}.html'

        r = requests.get(url, headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        active_page = soup.find('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
        pages = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

        if active_page is None:
            break
        for page in pages:
            price_color = page.find('p', class_ = 'price_color').text.strip()
        print(url)
        page_number = page_number + 1


The output I get from this code is:
Output:
https://books.toscrape.com/catalogue/category/books/romance_8/page-1.html https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html
This is partly correct, but I wanted the code to advance to the next item on the list "childrens_11" and get a list of URLs for this category. This coincidently has 2 pages also, so if the code was working correctly, it would show:

Output:
https://books.toscrape.com/catalogue/category/books/romance_8/page-1.html https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html https://books.toscrape.com/catalogue/category/books/childrens_11/page-1.html https://books.toscrape.com/catalogue/category/books/childrens_11/page-2.html
Could someone please enlighten me how to fix the code to enable this?

Thank you.
Reply
#2
Something like this.
import requests
from bs4 import BeautifulSoup

cats = ['romance_8', 'childrens_11', 'travel_2']
pages = []
for book in cats:
    url = f'https://books.toscrape.com/catalogue/category/books/{book}'
    #url = 'http://books.toscrape.com/catalogue/category/books/travel_2' # one page
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    singel_page = soup.select_one('li.next > a')
    if singel_page is None:
        pages.append(response.url)
    else:
        # Here just set a high number and break out if not 200
        for page_nr in range(1, 100):
            gen_url = f'{url}/page-{page_nr}.html'
            page = requests.get(gen_url)
            if page.status_code == 200:
                pages.append(page.url)
            else:
                break
>>> pages
['https://books.toscrape.com/catalogue/category/books/romance_8/page-1.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/page-1.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/page-2.html',
 'http://books.toscrape.com/catalogue/category/books/travel_2/']
Some points i also check for single pages that not has next page,so that it expand list it will add also those pages.
Eg should work for all kind of list,so here a one with 3 pages and two singe pages.
cats = ['fantasy_19', 'travel_2', 'science_22']
>>> pages
['https://books.toscrape.com/catalogue/category/books/fantasy_19/page-1.html',
 'https://books.toscrape.com/catalogue/category/books/fantasy_19/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/fantasy_19/page-3.html',
 'http://books.toscrape.com/catalogue/category/books/travel_2/',
 'http://books.toscrape.com/catalogue/category/books/science_22/']
knight2000 likes this post
Reply
#3
Thank you so much snippsat. I'm going to read that a few times to study and try and learn it.

Pardon my ignorance, but on line 11 of your code:

singel_page = soup.select_one('li.next > a')
, could you please enlighten me what that's doing? It's looking for the "next" tag I think, but I'm not sure about the >a?
Reply
#4
(May-23-2023, 11:15 AM)knight2000 Wrote: could you please enlighten me what that's doing? It's looking for the "next" tag I think, but I'm not sure about the >a?
It's a CSS Selector,BS4 support this trough select and select_one.
This is a powerful way to get exact data that want to scrape,can also copy Selector directly for browser this make it easy to get correct tag.
Most use find and find_all,but it's ok to know that CSS Selector is supported in BS4.

Lest say i just copy header tag category Selector from browser(inspect -> over tag wanted -> right click -> copy -> Cope Selector)
#default > div > div > div > div > div.page-header.action > h1
Then can also organize code a little better,and use this Selector to find header tag.
import requests
from bs4 import BeautifulSoup

def book_urls(books):
    pages = []
    for book in books:
        url = f'https://books.toscrape.com/catalogue/category/books/{book}'
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')
        singel_page = soup.select_one('li.next > a')
        if singel_page is None:
            pages.append(response.url)
        else:
            for page_nr in range(1, 100):
                gen_url = f'{url}/page-{page_nr}.html'
                page = requests.get(gen_url)
                if page.status_code == 200:
                    pages.append(page.url)
                else:
                    break
    return pages

def scrape_books(book_urls):
    for book in book_urls:
        response = requests.get(book)
        soup = BeautifulSoup(response.content, 'lxml')
        print(soup.select_one('#default > div > div > div > div > div.page-header.action > h1').text)

if __name__ == '__main__':
    books = ['fantasy_19', 'travel_2', 'science_22']
    #print(book_urls(books))
    book_urls = book_urls(books)
    scrape_books(book_urls)
Output:
Fantasy Fantasy Fantasy Travel Science
knight2000 likes this post
Reply
#5
That's very cool. Thank you for taking the time to explain more to me about it. I've only dabbled with find/find_all with my basic knowledge, but I'll try learning more about this and trying it out as well- super interesting. Again, thank you for your time and your direction with this.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Advancing Through Variables In A List knight2000 0 526 May-13-2023, 03:30 AM
Last Post: knight2000
  Print Numbers starting at 1 vertically with separator for output numbers Pleiades 3 3,746 May-09-2019, 12:19 PM
Last Post: Pleiades

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020