pagination for non standarded pages - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: pagination for non standarded pages (/thread-20795.html) Pages:
1
2
|
pagination for non standarded pages - zarize - Aug-30-2019 hi guys, i am learning scraping and i currently i ve stopped in point of doing pagination in script below: from bs4 import BeautifulSoup import requests import pandas as pd import re import urllib headers = { 'Sec-Fetch-Mode': 'cors', 'Referer': 'https://www.pararius.com/apartments/amsterdam', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Content-Type': 'text/plain', } data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}' page_number = 2 page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(page_number) r = requests.get(page, headers=headers, data=data) content = (r.text) soup = BeautifulSoup(content, 'html.parser') for section in soup.find_all(class_='property-list-item-container'): dlink = section.find('a').get('href') type = section.find('span', {'class': 'type'}).text neighborhood = section.find('a').text.strip().split()[1] size = section.find('li', {'class': 'surface'}).text.strip().split()[0] bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2] furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4] if furniture == 'upholstered': furniture = "Unfurnished" elif furniture == 'furnished or upholstered': furniture = "Furnished & Unfurnished" availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6] price = section.find('p', {'class': 'price '}).text.strip().split()[0] curr = "EUR" if "€" in price else "other" print(curr) breakI have to add that it might happend that result from the site has let's say 50 pages, and it can happen that it has 30 only... how to deal with it? what should be my next step? I would appreciate any kind of help/tip! RE: pagination for non standarded pages - buran - Aug-30-2019 in this particular case there is element <ul class="pagination"> from where you can get information about the number of pagesanother approach is to keep track how many results you have scraped and compare with the total number available at the top of the page in <p class="count"> element in other cases it may happen that you get all data e.g. in json and is easier to make the respective request instead of screen scraping by parsing the html, etc. RE: pagination for non standarded pages - zarize - Aug-30-2019 thank you! :) do you know why i cant get text from below? for section2 in soup.find('ul', {'class': 'pagination'}): pages_total1 = section2.find('li') print(pages_total1)it clearly has a text inside which is int() of course but i tried to convert it to str() and then get text, however, no results What i want now is to get text first, then strip and split it and then use len-1 to find max number of pages RE: pagination for non standarded pages - buran - Aug-30-2019 soup.find('ul', {'class': 'pagination'}) will return single element and you cannot iterate over itno tested, but something like pagination = soup.find('ul', {'class': 'pagination'}) pages = pagination.find_all('li') num_pages = int(pages[-2].text) print(num_pages)if you inspect the ul element you will notice there are 4 types of li items (without class attribute, class=empty, class=next, class=is-active). you can play with custom function that will return just li items without class. RE: pagination for non standarded pages - zarize - Aug-30-2019 thank you for your support, but i am still struggling how to get number of pages page1 = soup.find('ul', {'class': 'pagination'}) pages = str(page1.find('li', {'class': 'last'})) print(pages[-16:-14])by using this, theoretically i get what i want.. but let's assume that it is <10 page then function above would return me number + one random character (like slash or something like this) would work something like: print(pages[-16:-14]) if < 10 then print(pages[-15:-14])? RE: pagination for non standarded pages - buran - Aug-30-2019 don't convert html tags to str pagination = soup.find('ul', {'class': 'pagination'}) pages = pagination.find_all('li') last_page = pages[-3] num_pages = last_page.find('a').text print(num_pages)I didn't notice there are nested li elements. RE: pagination for non standarded pages - zarize - Sep-02-2019 thank you :) #edit can i ask you how to structure it properly so python can read 'num_pages'? In case i am placing range 1, 12 it is working i mean it returns direct link from 1 to 11. from bs4 import BeautifulSoup import requests import pandas as pd import re import urllib headers = { 'Sec-Fetch-Mode': 'cors', 'Referer': 'https://www.pararius.com/apartments/amsterdam', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Content-Type': 'text/plain', } data = '{"tags":[{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"5f5a2718d3aa6d","id":11247563,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true},{"sizes":[{"width":728,"height":90},{"width":970,"height":250}],"primary_size":{"width":728,"height":90},"ad_types":["banner"],"uuid":"66526a063a1a8c","id":11247564,"allow_smaller_sizes":false,"use_pmt_rule":false,"prebid":true,"disable_psa":true}],"sdk":{"source":"pbjs","version":"2.19.0-pre"},"gdpr_consent":{"consent_string":"BOmDsv2OmDsv2BQABBENCN-AAAAmd7_______9______5uz_Ov_v_f__33e8__9v_l_7_-___u_-3zd4-_1vf99yfm1-7etr3tp_87ues2_Xur__59__3z3_9phPrsk89ryw","consent_required":true},"referrer_detection":{"rd_ref":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam","rd_top":true,"rd_ifs":1,"rd_stk":"https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam,https%3A%2F%2Fwww.pararius.com%2Fapartments%2Famsterdam"}}' for n in range(1, 12): page = 'https://www.pararius.com/apartments/amsterdam/page-' + str(n) print(num_pages) r = requests.get(page, headers=headers, data=data) content = (r.text) soup = BeautifulSoup(content, 'html.parser') #pagination- find max pages page1 = soup.find('ul', {'class': 'pagination'}) pages = page1.find_all('li') last_page = pages[-3] num_pages = last_page.find('a').text #print(num_pages) for section in soup.find_all(class_='property-list-item-container'): dlink = section.find('a').get('href') type = section.find('span', {'class': 'type'}).text neighborhood = section.find('a').text.strip().split()[1] size = section.find('li', {'class': 'surface'}).text.strip().split()[0] bedrooms = section.find('li', {'class': 'surface'}).text.strip().split()[2] furniture = section.find('li', {'class': 'surface'}).text.strip().split()[4] if furniture == 'upholstered': furniture = "Unfurnished" elif furniture == 'furnished or upholstered': furniture = "Furnished & Unfurnished" availablefrom = size = section.find('li', {'class': 'surface'}).text.strip().split()[6] price = section.find('p', {'class': 'price '}).text.strip().split()[0] curr = "EUR" if "€" in price else "other" breakWhat i am doing wrong? I wanted to do for n in range(1, num_pages):but it doesnt work I was trying to replace code upper/lower but i am doing something wrong.. RE: pagination for non standarded pages - buran - Sep-02-2019 you need to retrieve the num_pages before being able to use it. you can retrieve the first page data and num_pages. then loop in range(2, num_pages+1) and get data you want for rest of pages. Now is good time to split your code into functions RE: pagination for non standarded pages - zarize - Sep-02-2019 (Sep-02-2019, 10:52 AM)buran Wrote: you need to retrieve the num_pages before being able to use it. Could i ask you for some additional help? I got stucked :( i would appreciate any help How it come that i cant get num_pages from it? #pagination- find max pages page1 = soup.find('ul', {'class': 'pagination'}) pages = page1.find_all('li') last_page = pages[-3] num_pages = last_page.find('a').text #print(num_pages) for n in range(1, num_pages): page = 'https://www.pararius.com/apartments/amsterdam/page-1' + n print(num_pages)it returns me: TypeError: 'str' object cannot be interpreted as an integer so i tried to change it to int but didn't work. I guess i have to make a function that returns me num_pages and then i could call it? It would be my first function :P can i get an assist? RE: pagination for non standarded pages - buran - Sep-02-2019 you need to convert to integer with int() function. see my previous example. you should be able to fix errors like this by now. also note that range(1, 55) will give you numbers from 1 to 54
|