pagination for non standarded pages

pagination for non standarded pages - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: pagination for non standarded pages (/thread-20795.html)

Pages: 1 2

pagination for non standarded pages - zarize - Aug-30-2019

hi guys,

i am learning scraping and i currently i ve stopped in point of doing pagination in script below:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'

page_number = 2
page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(page_number)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')



for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"
    print(curr)
    break

I have to add that it might happend that result from the site has let's say 50 pages, and it can happen that it has 30 only... how to deal with it?
what should be my next step?

I would appreciate any kind of help/tip!

RE: pagination for non standarded pages - buran - Aug-30-2019

in this particular case there is element <ul class="pagination"> from where you can get information about the number of pages

another approach is to keep track how many results you have scraped and compare with the total number available at the top of the page in <p class="count"> element

in other cases it may happen that you get all data e.g. in json and is easier to make the respective request instead of screen scraping by parsing the html, etc.

RE: pagination for non standarded pages - zarize - Aug-30-2019

thank you! :)

do you know why i cant get text from below?

for section2 in soup.find('ul', {'class': 'pagination'}):
    pages_total1 = section2.find('li')
    print(pages_total1)

it clearly has a text inside which is int() of course but i tried to convert it to str() and then get text, however, no results

What i want now is to get text first, then strip and split it and then use len-1 to find max number of pages

RE: pagination for non standarded pages - buran - Aug-30-2019

soup.find('ul', {'class': 'pagination'}) will return single element and you cannot iterate over it
no tested, but something like

pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
num_pages = int(pages[-2].text)
print(num_pages)

if you inspect the ul element you will notice there are 4 types of li items (without class attribute, class=empty, class=next, class=is-active). you can play with custom function that will return just li items without class.

RE: pagination for non standarded pages - zarize - Aug-30-2019

thank you for your support, but i am still struggling how to get number of pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = str(page1.find('li', {'class': 'last'}))
print(pages[-16:-14])

by using this, theoretically i get what i want.. but let's assume that it is <10 page

then function above would return me number + one random character (like slash or something like this)

would work something like:

print(pages[-16:-14]) if < 10 then print(pages[-15:-14])

RE: pagination for non standarded pages - buran - Aug-30-2019

don't convert html tags to str

pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
print(num_pages)

I didn't notice there are nested li elements.

RE: pagination for non standarded pages - zarize - Sep-02-2019

thank you :)

#edit

can i ask you how to structure it properly so python can read 'num_pages'? In case i am placing range 1, 12 it is working i mean it returns direct link from 1 to 11.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'


for n in range(1, 12):
    page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(n)
    print(num_pages)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')


#pagination- find max pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)


for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"

    break

What i am doing wrong?
I wanted to do

for n in range(1, num_pages):

but it doesnt work

I was trying to replace code upper/lower but i am doing something wrong..

RE: pagination for non standarded pages - buran - Sep-02-2019

you need to retrieve the num_pages before being able to use it.
you can retrieve the first page data and num_pages.
then loop in range(2, num_pages+1) and get data you want for rest of pages.
Now is good time to split your code into functions

RE: pagination for non standarded pages - zarize - Sep-02-2019

(Sep-02-2019, 10:52 AM)buran Wrote: you need to retrieve the num_pages before being able to use it.

Could i ask you for some additional help? I got stucked :( i would appreciate any help

How it come that i cant get num_pages from it?

#pagination- find max pages
page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)

for n in range(1, num_pages):
    page = 'https://www.pararius.com/apartments/amsterdam/page-1' + n
    print(num_pages)

it returns me: TypeError: 'str' object cannot be interpreted as an integer
so i tried to change it to int but didn't work.

I guess i have to make a function that returns me num_pages and then i could call it? It would be my first function :P can i get an assist?

RE: pagination for non standarded pages - buran - Sep-02-2019

you need to convert to integer with int() function. see my previous example. you should be able to fix errors like this by now.
also note that range(1, 55) will give you numbers from 1 to 54