Advancing Page Numbers

knight2000 · May-22-2023, 07:09 AM

Hi all,

I want to be able to scrape different pages and categories within the same website. I have this code so far:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

from bs4 import BeautifulSoup
import requests
 
 
 
cats =['romance_8', 'childrens_11']
page_number = 1
 
 
 
 
for cat in cats:
    while True:
        url = f'https://books.toscrape.com/catalogue/category/books/{cat}/page-{page_number}.html'
 
        r = requests.get(url, headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        active_page = soup.find('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
        pages = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
 
        if active_page is None:
            break
        for page in pages:
            price_color = page.find('p', class_ = 'price_color').text.strip()
        print(url)
        page_number = page_number + 1

The output I get from this code is:

Output:https://books.toscrape.com/catalogue/category/books/romance_8/page-1.html
https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html

This is partly correct, but I wanted the code to advance to the next item on the list "childrens_11" and get a list of URLs for this category. This coincidently has 2 pages also, so if the code was working correctly, it would show:

Output:https://books.toscrape.com/catalogue/category/books/romance_8/page-1.html
https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html
https://books.toscrape.com/catalogue/category/books/childrens_11/page-1.html
https://books.toscrape.com/catalogue/category/books/childrens_11/page-2.html

Could someone please enlighten me how to fix the code to enable this?

Thank you.

***snippsat*** · (This post was last modified: May-22-2023, 12:40 PM by snippsat.)

Something like this.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

import requests
from bs4 import BeautifulSoup
 
cats = ['romance_8', 'childrens_11', 'travel_2']
pages = []
for book in cats:
    url = f'https://books.toscrape.com/catalogue/category/books/{book}'
    #url = 'http://books.toscrape.com/catalogue/category/books/travel_2' # one page
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    singel_page = soup.select_one('li.next > a')
    if singel_page is None:
        pages.append(response.url)
    else:
        # Here just set a high number and break out if not 200
        for page_nr in range(1, 100):
            gen_url = f'{url}/page-{page_nr}.html'
            page = requests.get(gen_url)
            if page.status_code == 200:
                pages.append(page.url)
            else:
                break

1

2

3

4

5

6

>>> pages
['https://books.toscrape.com/catalogue/category/books/romance_8/page-1.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/page-1.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/page-2.html',
 'http://books.toscrape.com/catalogue/category/books/travel_2/']

Some points i also check for single pages that not has next page,so that it expand list it will add also those pages.
Eg should work for all kind of list,so here a one with 3 pages and two singe pages.

1

2

3

4

5

6

7

cats = ['fantasy_19', 'travel_2', 'science_22']
>>> pages
['https://books.toscrape.com/catalogue/category/books/fantasy_19/page-1.html',
 'https://books.toscrape.com/catalogue/category/books/fantasy_19/page-2.html',
 'https://books.toscrape.com/catalogue/category/books/fantasy_19/page-3.html',
 'http://books.toscrape.com/catalogue/category/books/travel_2/',
 'http://books.toscrape.com/catalogue/category/books/science_22/']

knight2000 · May-23-2023, 11:15 AM

Thank you so much snippsat. I'm going to read that a few times to study and try and learn it.

Pardon my ignorance, but on line 11 of your code:

        
              singel_page = soup.select_one('li.next > a')

, could you please enlighten me what that's doing? It's looking for the "next" tag I think, but I'm not sure about the >a?

***snippsat*** · (This post was last modified: May-23-2023, 02:14 PM by snippsat.)

(May-23-2023, 11:15 AM)knight2000 Wrote: could you please enlighten me what that's doing? It's looking for the "next" tag I think, but I'm not sure about the >a?

It's a CSS Selector,BS4 support this trough select and select_one.
This is a powerful way to get exact data that want to scrape,can also copy Selector directly for browser this make it easy to get correct tag.
Most use find and find_all,but it's ok to know that CSS Selector is supported in BS4.

Lest say i just copy header tag category Selector from browser(inspect -> over tag wanted -> right click -> copy -> Cope Selector)

        
              #default > div > div > div > div > div.page-header.action > h1

Then can also organize code a little better,and use this Selector to find header tag.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

import requests
from bs4 import BeautifulSoup
 
def book_urls(books):
    pages = []
    for book in books:
        url = f'https://books.toscrape.com/catalogue/category/books/{book}'
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')
        singel_page = soup.select_one('li.next > a')
        if singel_page is None:
            pages.append(response.url)
        else:
            for page_nr in range(1, 100):
                gen_url = f'{url}/page-{page_nr}.html'
                page = requests.get(gen_url)
                if page.status_code == 200:
                    pages.append(page.url)
                else:
                    break
    return pages
 
def scrape_books(book_urls):
    for book in book_urls:
        response = requests.get(book)
        soup = BeautifulSoup(response.content, 'lxml')
        print(soup.select_one('#default > div > div > div > div > div.page-header.action > h1').text)
 
if __name__ == '__main__':
    books = ['fantasy_19', 'travel_2', 'science_22']
    #print(book_urls(books))
    book_urls = book_urls(books)
    scrape_books(book_urls)

Output:Fantasy
Fantasy
Fantasy
Travel
Science

knight2000 · May-24-2023, 09:14 AM

That's very cool. Thank you for taking the time to explain more to me about it. I've only dabbled with find/find_all with my basic knowledge, but I'll try learning more about this and trying it out as well- super interesting. Again, thank you for your time and your direction with this.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Advancing Through Variables In A List	knight2000	0	1,069	May-13-2023, 03:30 AM Last Post: knight2000
	Print Numbers starting at 1 vertically with separator for output numbers	Pleiades	3	5,193	May-09-2019, 12:19 PM Last Post: Pleiades

Advancing Page Numbers

User Panel Messages

Announcements