Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 pagination for non standarded pages
#1
hi guys,

i am learning scraping and i currently i ve stopped in point of doing pagination in script below:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'

page_number = 2
page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(page_number)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')



for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"
    print(curr)
    break

I have to add that it might happend that result from the site has let's say 50 pages, and it can happen that it has 30 only... how to deal with it?
what should be my next step?

I would appreciate any kind of help/tip!
Quote
#2
in this particular case there is element <ul class="pagination"> from where you can get information about the number of pages

another approach is to keep track how many results you have scraped and compare with the total number available at the top of the page in <p class="count"> element

in other cases it may happen that you get all data e.g. in json and is easier to make the respective request instead of screen scraping by parsing the html, etc.
zarize likes this post
Quote
#3
thank you! :)

do you know why i cant get text from below?

for section2 in soup.find('ul', {'class': 'pagination'}):
    pages_total1 = section2.find('li')
    print(pages_total1)
it clearly has a text inside which is int() of course but i tried to convert it to str() and then get text, however, no results


What i want now is to get text first, then strip and split it and then use len-1 to find max number of pages
Quote
#4
soup.find('ul', {'class': 'pagination'}) will return single element and you cannot iterate over it
no tested, but something like
pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
num_pages = int(pages[-2].text)
print(num_pages)
if you inspect the ul element you will notice there are 4 types of li items (without class attribute, class=empty, class=next, class=is-active). you can play with custom function that will return just li items without class.
Quote
#5
thank you for your support, but i am still struggling how to get number of pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = str(page1.find('li', {'class': 'last'}))
print(pages[-16:-14])
by using this, theoretically i get what i want.. but let's assume that it is <10 page

then function above would return me number + one random character (like slash or something like this)


would work something like:
print(pages[-16:-14]) if < 10 then print(pages[-15:-14]) 
?
Quote
#6
don't convert html tags to str

pagination = soup.find('ul', {'class': 'pagination'})
pages = pagination.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
print(num_pages)
I didn't notice there are nested li elements.
zarize likes this post
Quote
#7
thank you :)

#edit

can i ask you how to structure it properly so python can read 'num_pages'? In case i am placing range 1, 12 it is working i mean it returns direct link from 1 to 11.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.pararius.com/apartments/amsterdam',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Content-Type': 'text/plain',
}

data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}'


for n in range(1, 12):
    page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(n)
    print(num_pages)
    
r = requests.get(page, headers=headers, data=data)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')


#pagination- find max pages

page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)


for section in soup.find_all(class_='property-list-item-container'):
    dlink = section.find('a').get('href')
    type = section.find('span', {'class': 'type'}).text
    neighborhood = section.find('a').text.strip().split()[1]
    size = section.find('li', {'class': 'surface'}).text.strip().split()[0]
    bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2]
    furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4]
    if furniture == 'upholstered':
        furniture = "Unfurnished"
    elif furniture == 'furnished or upholstered':
        furniture = "Furnished & Unfurnished"
    availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6]
    price = section.find('p', {'class': 'price '}).text.strip().split()[0]
    curr = "EUR" if "€" in price else "other"

    break

What i am doing wrong?
I wanted to do
for n in range(1, num_pages):
but it doesnt work

I was trying to replace code upper/lower but i am doing something wrong..
Quote
#8
you need to retrieve the num_pages before being able to use it.
you can retrieve the first page data and num_pages.
then loop in range(2, num_pages+1) and get data you want for rest of pages.
Now is good time to split your code into functions
Quote
#9
(Sep-02-2019, 10:52 AM)buran Wrote: you need to retrieve the num_pages before being able to use it.

Could i ask you for some additional help? I got stucked :( i would appreciate any help


How it come that i cant get num_pages from it?

#pagination- find max pages
page1 = soup.find('ul', {'class': 'pagination'})
pages = page1.find_all('li')
last_page = pages[-3]
num_pages = last_page.find('a').text
#print(num_pages)

for n in range(1, num_pages):
    page = 'https://www.pararius.com/apartments/amsterdam/page-1' + n
    print(num_pages)
it returns me: TypeError: 'str' object cannot be interpreted as an integer
so i tried to change it to int but didn't work.

I guess i have to make a function that returns me num_pages and then i could call it? It would be my first function :P can i get an assist?
Quote
#10
you need to convert to integer with int() function. see my previous example. you should be able to fix errors like this by now.
also note that range(1, 55) will give you numbers from 1 to 54
zarize likes this post
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Extracting Headers from Many Pages Quickly OstermanA 1 116 Aug-31-2019, 11:43 AM
Last Post: metulburr
  Protected Pages with Django xxp2 2 287 Feb-12-2019, 07:28 PM
Last Post: xxp2
  Python - Scrapy Javascript Pagination (next_page) Baggelhsk95 3 2,158 Oct-08-2018, 01:20 PM
Last Post: stranac
  Scraping external URLs from pages Apook 5 1,112 Jul-18-2018, 06:42 PM
Last Post: nilamo
  scraping multiple pages of a website. Blue Dog 14 8,763 Jun-21-2018, 09:03 PM
Last Post: Blue Dog
  Filtering and pagination garynobles 0 639 Jun-14-2018, 08:11 PM
Last Post: garynobles
  BeautifulSoup and pagination. Mike Ru 1 3,342 Sep-22-2017, 10:15 AM
Last Post: snippsat

Forum Jump:


Users browsing this thread: 1 Guest(s)