Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beautifulsoup Scraping
#1
Good day folks,

I started last week my first webscraping project. The code is working (beginner level), but I have some struggle to continue it. It would be nice if you could give some advice or give a step into the right direction.




1. looping over different webpages

On the webpage there are several pages that I need to loop over. I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.

2. problem with the id variable

I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like hp

Thanks for the assistance,
Regards,
Reply
#2
(Jun-21-2019, 10:26 AM)PolskaYBZ Wrote: I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.
The link is a little wrong and can use f-string to insert into link string.
for page in range (1,4):
    link = f"https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38&page={page}"
    print(link) 
Now can you test output and see that eg link 2 switch to page 2.
Quote:I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like hp
data-id is an attribute in link a,so can use attrs to get it.
from bs4 import BeautifulSoup

html = '''\
<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="45264798" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">'''

soup = BeautifulSoup(html, 'lxml')
link_a = soup.find('a')
>>> link_a.attrs
{'class': ['button-observed',
           'observe-link',
           'favourites-button',
           'observed-text',
           'svg-heart',
           'add-to-favourites'],
 'data-id': '45264798',
 'data-statkey': 'ad.observed.list',
 'href': '#',
 'rel': ['nofollow'],
 'title': 'Obserwuj'}

>>> link_a.attrs['data-id']
'45264798'
Reply
#3
1)
you have the wrong link for changing page numbers
try:
https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38&page=5
as opposed to:
https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38#form&page=5
omitting the #form part

2)
You only have to add the data-id attribute to obtain that part
house.find('a',class_="button-observed observe-link favourites-button observed-text")['data-id']
Output:
metulburr@ubuntu:~$ python3.6 forum11.py <Response [200]> 59342283
Recommended Tutorials:
Reply
#4
Both comments helped a lot, much appreciated :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Scraping based on years BeautifulSoup rhat398 0 1,755 May-22-2021, 07:20 PM
Last Post: rhat398
  Combining selenium and beautifulsoup for web scraping sumandas89 3 11,636 Jan-30-2018, 02:14 PM
Last Post: metulburr
  Scraping with BeautifulSoup Prince_Bhatia 8 6,993 Sep-07-2017, 06:34 PM
Last Post: Prince_Bhatia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020