pagination for non standarded pages

zarize · (This post was last modified: Aug-30-2019, 12:53 PM by zarize.)

hi guys,

i am learning scraping and i currently i ve stopped in point of doing pagination in script below:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'

page_number = 2
page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(page_number)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')



for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"
    print(curr)
    break

I have to add that it might happend that result from the site has let's say 50 pages, and it can happen that it has 30 only... how to deal with it?
what should be my next step?

I would appreciate any kind of help/tip!

**buran** · Aug-30-2019, 01:10 PM

in this particular case there is element <ul class="pagination"> from where you can get information about the number of pages

another approach is to keep track how many results you have scraped and compare with the total number available at the top of the page in <p class="count"> element

in other cases it may happen that you get all data e.g. in json and is easier to make the respective request instead of screen scraping by parsing the html, etc.

zarize · (This post was last modified: Aug-30-2019, 01:55 PM by zarize.)

thank you! :)

do you know why i cant get text from below?

for section2 in soup.find('ul', {'class': 'pagination'}):
    pages_total1 = section2.find('li')
    print(pages_total1)

it clearly has a text inside which is int() of course but i tried to convert it to str() and then get text, however, no results

What i want now is to get text first, then strip and split it and then use len-1 to find max number of pages

**buran** · (This post was last modified: Aug-30-2019, 02:45 PM by buran.)

soup.find('ul', {'class': 'pagination'}) will return single element and you cannot iterate over it
no tested, but something like

pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
num_pages = int(pages[-2].text)
print(num_pages)

if you inspect the ul element you will notice there are 4 types of li items (without class attribute, class=empty, class=next, class=is-active). you can play with custom function that will return just li items without class.

zarize · (This post was last modified: Aug-30-2019, 04:04 PM by zarize.)

thank you for your support, but i am still struggling how to get number of pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = str(page1.find('li', {'class': 'last'}))
print(pages[-16:-14])

by using this, theoretically i get what i want.. but let's assume that it is <10 page

then function above would return me number + one random character (like slash or something like this)

would work something like:

print(pages[-16:-14]) if < 10 then print(pages[-15:-14])

?

**buran** · (This post was last modified: Aug-30-2019, 04:40 PM by buran.)

don't convert html tags to str

pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
print(num_pages)

I didn't notice there are nested li elements.

zarize · (This post was last modified: Sep-02-2019, 10:52 AM by zarize.)

thank you :)

#edit

can i ask you how to structure it properly so python can read 'num_pages'? In case i am placing range 1, 12 it is working i mean it returns direct link from 1 to 11.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'


for n in range(1, 12):
    page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(n)
    print(num_pages)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')


#pagination- find max pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)


for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"

    break

What i am doing wrong?
I wanted to do

for n in range(1, num_pages):

but it doesnt work

I was trying to replace code upper/lower but i am doing something wrong..

**buran** · Sep-02-2019, 10:52 AM

you need to retrieve the num_pages before being able to use it.
you can retrieve the first page data and num_pages.
then loop in range(2, num_pages+1) and get data you want for rest of pages.
Now is good time to split your code into functions

zarize · Sep-02-2019, 12:15 PM

(Sep-02-2019, 10:52 AM)buran Wrote: you need to retrieve the num_pages before being able to use it.

Could i ask you for some additional help? I got stucked :( i would appreciate any help

How it come that i cant get num_pages from it?

#pagination- find max pages
page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)

for n in range(1, num_pages):
    page = 'https://www.pararius.com/apartments/amsterdam/page-1' + n
    print(num_pages)

it returns me: TypeError: 'str' object cannot be interpreted as an integer
so i tried to change it to int but didn't work.

I guess i have to make a function that returns me num_pages and then i could call it? It would be my first function :P can i get an assist?

**buran** · Sep-02-2019, 12:18 PM

you need to convert to integer with int() function. see my previous example. you should be able to fix errors like this by now.
also note that range(1, 55) will give you numbers from 1 to 54

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup pagination using href	rhat398	1	2,421	Jun-30-2021, 10:55 AM Last Post: snippsat
	Python beautifulsoup pagination error	The61	5	3,489	Apr-09-2020, 09:17 PM Last Post: Larz60+
	Pagination	prejni	2	2,413	Nov-18-2019, 10:45 AM Last Post: alekson
	Scrapy Javascript Pagination (next_page)	nazmulfinance	2	3,035	Nov-18-2019, 01:01 AM Last Post: nazmulfinance
	Python - Scrapy Javascript Pagination (next_page)	Baggelhsk95	3	10,017	Oct-08-2018, 01:20 PM Last Post: stranac
	Filtering and pagination	garynobles	0	37,761	Jun-14-2018, 08:11 PM Last Post: garynobles
	BeautifulSoup and pagination.	Mike Ru	1	7,911	Sep-22-2017, 10:15 AM Last Post: snippsat

pagination for non standarded pages

User Panel Messages

Announcements