Python Forum

Full Version: Beautifulsoup Scraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Good day folks,

I started last week my first webscraping project. The code is working (beginner level), but I have some struggle to continue it. It would be nice if you could give some advice or give a step into the right direction.




1. looping over different webpages

On the webpage there are several pages that I need to loop over. I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.

2. problem with the id variable

I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like hp

Thanks for the assistance,
Regards,
(Jun-21-2019, 10:26 AM)PolskaYBZ Wrote: [ -> ]I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.
The link is a little wrong and can use f-string to insert into link string.
for page in range (1,4):
    link = f"https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38&page={page}"
    print(link) 
Now can you test output and see that eg link 2 switch to page 2.
Quote:I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like hp
data-id is an attribute in link a,so can use attrs to get it.
from bs4 import BeautifulSoup

html = '''\
<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="45264798" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">'''

soup = BeautifulSoup(html, 'lxml')
link_a = soup.find('a')
>>> link_a.attrs
{'class': ['button-observed',
           'observe-link',
           'favourites-button',
           'observed-text',
           'svg-heart',
           'add-to-favourites'],
 'data-id': '45264798',
 'data-statkey': 'ad.observed.list',
 'href': '#',
 'rel': ['nofollow'],
 'title': 'Obserwuj'}

>>> link_a.attrs['data-id']
'45264798'
1)
you have the wrong link for changing page numbers
try:
https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38&page=5
as opposed to:
https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38#form&page=5
omitting the #form part

2)
You only have to add the data-id attribute to obtain that part
house.find('a',class_="button-observed observe-link favourites-button observed-text")['data-id']
Output:
metulburr@ubuntu:~$ python3.6 forum11.py <Response [200]> 59342283
Both comments helped a lot, much appreciated :)