Python Forum
pagination for non standarded pages
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pagination for non standarded pages
#1
hi guys,

i am learning scraping and i currently i ve stopped in point of doing pagination in script below:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'

page_number = 2
page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(page_number)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')



for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"
    print(curr)
    break
I have to add that it might happend that result from the site has let's say 50 pages, and it can happen that it has 30 only... how to deal with it?
what should be my next step?

I would appreciate any kind of help/tip!
Reply
#2
in this particular case there is element <ul class="pagination"> from where you can get information about the number of pages

another approach is to keep track how many results you have scraped and compare with the total number available at the top of the page in <p class="count"> element

in other cases it may happen that you get all data e.g. in json and is easier to make the respective request instead of screen scraping by parsing the html, etc.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
thank you! :)

do you know why i cant get text from below?

for section2 in soup.find('ul', {'class': 'pagination'}):
    pages_total1 = section2.find('li')
    print(pages_total1)
it clearly has a text inside which is int() of course but i tried to convert it to str() and then get text, however, no results


What i want now is to get text first, then strip and split it and then use len-1 to find max number of pages
Reply
#4
soup.find('ul', {'class': 'pagination'}) will return single element and you cannot iterate over it
no tested, but something like
pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
num_pages = int(pages[-2].text)
print(num_pages)
if you inspect the ul element you will notice there are 4 types of li items (without class attribute, class=empty, class=next, class=is-active). you can play with custom function that will return just li items without class.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
thank you for your support, but i am still struggling how to get number of pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = str(page1.find('li', {'class': 'last'}))
print(pages[-16:-14])
by using this, theoretically i get what i want.. but let's assume that it is <10 page

then function above would return me number + one random character (like slash or something like this)


would work something like:
print(pages[-16:-14]) if < 10 then print(pages[-15:-14]) 
?
Reply
#6
don't convert html tags to str

pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
print(num_pages)
I didn't notice there are nested li elements.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
thank you :)

#edit

can i ask you how to structure it properly so python can read 'num_pages'? In case i am placing range 1, 12 it is working i mean it returns direct link from 1 to 11.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'


for n in range(1, 12):
    page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(n)
    print(num_pages)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')


#pagination- find max pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)


for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"

    break
What i am doing wrong?
I wanted to do
for n in range(1, num_pages):
but it doesnt work

I was trying to replace code upper/lower but i am doing something wrong..
Reply
#8
you need to retrieve the num_pages before being able to use it.
you can retrieve the first page data and num_pages.
then loop in range(2, num_pages+1) and get data you want for rest of pages.
Now is good time to split your code into functions
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#9
(Sep-02-2019, 10:52 AM)buran Wrote: you need to retrieve the num_pages before being able to use it.

Could i ask you for some additional help? I got stucked :( i would appreciate any help


How it come that i cant get num_pages from it?

#pagination- find max pages
page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)

for n in range(1, num_pages):
    page = 'https://www.pararius.com/apartments/amsterdam/page-1' + n
    print(num_pages)
it returns me: TypeError: 'str' object cannot be interpreted as an integer
so i tried to change it to int but didn't work.

I guess i have to make a function that returns me num_pages and then i could call it? It would be my first function :P can i get an assist?
Reply
#10
you need to convert to integer with int() function. see my previous example. you should be able to fix errors like this by now.
also note that range(1, 55) will give you numbers from 1 to 54
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup pagination using href rhat398 1 2,355 Jun-30-2021, 10:55 AM
Last Post: snippsat
  Python beautifulsoup pagination error The61 5 3,403 Apr-09-2020, 09:17 PM
Last Post: Larz60+
  Pagination prejni 2 2,363 Nov-18-2019, 10:45 AM
Last Post: alekson
  Scrapy Javascript Pagination (next_page) nazmulfinance 2 2,989 Nov-18-2019, 01:01 AM
Last Post: nazmulfinance
  Python - Scrapy Javascript Pagination (next_page) Baggelhsk95 3 9,919 Oct-08-2018, 01:20 PM
Last Post: stranac
  Filtering and pagination garynobles 0 34,379 Jun-14-2018, 08:11 PM
Last Post: garynobles
  BeautifulSoup and pagination. Mike Ru 1 7,850 Sep-22-2017, 10:15 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020