Beautifulsoup Scraping

PolskaYBZ · Jun-21-2019, 10:26 AM

Good day folks,

I started last week my first webscraping project. The code is working (beginner level), but I have some struggle to continue it. It would be nice if you could give some advice or give a step into the right direction.

Hide/Show

from bs4 import BeautifulSoup
from requests import get
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
from time import sleep
from random import randint
sns.set()

headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

for page in range (0,5):
    link = "https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38#form&page=(page)"
    response = get(link, headers = headers)
    print(response)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    house_containers = html_soup.find_all('div', class_="offer-item-details")
    list_houseprice = []
    idlist = []

    for house in house_containers:
        hp = house.find('li', class_="offer-item-price").text
        hp = hp.replace('\n','')
        hp = hp.replace('/mc','')
        hp = hp.replace(" ", "")
        houseprice.append(hp)

        id = house.find('a',class_="button-observed observe-link favourites-button observed-text")

sleep(randint(1,2))

1. looping over different webpages

On the webpage there are several pages that I need to loop over. I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.

2. problem with the id variable

I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like hp

Hide/Show

<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="45264798" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">

<span class="icon observed-45264798"></span>
<svg width="20px" height="20px" viewBox="0 0 52 47" version="1.1" xmlns="http://www.w3.org/2000/svg">
<defs></defs>
<g id="14_SubAccounts" stroke="none" stroke-width="1" fill-rule="evenodd">
<g id="#-styleguide" transform="translate(-290.000000, -3431.000000)" fill-rule="nonzero">
<g id="heart-regular" transform="translate(290.000000, 3417.000000)">
<g id="Group-7" transform="translate(0.000000, 14.000000)">
<g id="Group-6">
<path fill="none" d="M44.9749792,5.0327098 C39.5562332,0.472655041 31.1843643,1.15760154 25.9999932,6.44013411 C20.815622,1.15760154 12.4437532,0.463272213 7.0250072,5.0327098 C-0.0249875872,10.9720404 1.00626165,20.6551196 6.03125794,25.7875269 L22.4749958,42.5546419 C23.4124951,43.5116904 24.6687442,44.0465116 25.9999932,44.0465116 C27.3406172,44.0465116 28.5874913,43.5210732 29.5249906,42.5640247 L45.9687284,25.7969098 C50.9843497,20.6645025 52.0343489,10.9814232 44.9749792,5.0327098 Z" id="Path1"></path>
<path fill="black" d="M46.2718892,3.50241459 C53.7431472,9.79821295 53.7229457,20.7330078 47.4057954,27.197245 L30.9628717,43.9635293 C29.6487574,45.3050462 27.8867501,46.04647 26.009144,46.04647 C24.1408152,46.04647 22.3695113,45.2956437 21.0562304,43.9549771 L4.61132027,27.1866659 C-1.70790147,20.7323686 -1.72137807,9.79370898 5.74484956,3.50371964 C11.4787764,-1.33149882 20.0613911,-1.07727526 26.0092656,3.74139162 C31.9542156,-1.07326776 40.5355381,-1.32491546 46.2718892,3.50241459 Z M27.4365577,7.8409816 L26.009144,9.29542206 L24.5817303,7.8409816 C20.0856339,3.25975559 12.8944426,2.70707366 8.32274593,6.56222409 C2.70796719,11.2924442 2.7181226,19.5353679 7.46832495,24.3871085 L23.9128769,41.155054 C24.4776636,41.731618 25.2204435,42.04647 26.009144,42.04647 C26.8109459,42.04647 27.5432365,41.7383341 28.1062252,41.1636062 L44.5474853,24.3990223 C49.2979771,19.5379178 49.3133455,11.2966718 43.6958622,6.56249388 C39.1208731,2.7129779 31.9290821,3.26339526 27.4365577,7.8409816 Z" id="Path2"></path>
</g>
</g>
</g>
</g>
</g>
</svg>
<div class="observed-label">Dodaj do ulubionych</div>
</div>
</a>

Thanks for the assistance,
Regards,

***snippsat*** · (This post was last modified: Jun-21-2019, 11:48 AM by snippsat.)

(Jun-21-2019, 10:26 AM)PolskaYBZ Wrote: I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.

The link is a little wrong and can use f-string to insert into link string.

for page in range (1,4):
    link = f"https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38&page={page}"
    print(link)

Now can you test output and see that eg link 2 switch to page 2.

Quote:I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like hp

data-id is an attribute in link a,so can use attrs to get it.

from bs4 import BeautifulSoup

html = '''\
<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="45264798" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">'''

soup = BeautifulSoup(html, 'lxml')
link_a = soup.find('a')

>>> link_a.attrs
{'class': ['button-observed',
           'observe-link',
           'favourites-button',
           'observed-text',
           'svg-heart',
           'add-to-favourites'],
 'data-id': '45264798',
 'data-statkey': 'ad.observed.list',
 'href': '#',
 'rel': ['nofollow'],
 'title': 'Obserwuj'}

>>> link_a.attrs['data-id']
'45264798'

***metulburr*** · (This post was last modified: Jun-21-2019, 12:09 PM by metulburr.)

1)
you have the wrong link for changing page numbers
try:

https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38&page=5

as opposed to:

https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38#form&page=5

omitting the #form part

2)
You only have to add the data-id attribute to obtain that part

house.find('a',class_="button-observed observe-link favourites-button observed-text")['data-id']

Output:metulburr@ubuntu:~$ python3.6 forum11.py
<Response [200]>
59342283

PolskaYBZ · (This post was last modified: Jun-22-2019, 10:05 AM by PolskaYBZ.)

Both comments helped a lot, much appreciated :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping based on years BeautifulSoup	rhat398	0	1,755	May-22-2021, 07:20 PM Last Post: rhat398
	Combining selenium and beautifulsoup for web scraping	sumandas89	3	11,636	Jan-30-2018, 02:14 PM Last Post: metulburr
	Scraping with BeautifulSoup	Prince_Bhatia	8	6,993	Sep-07-2017, 06:34 PM Last Post: Prince_Bhatia

Beautifulsoup Scraping

User Panel Messages

Announcements