Jun-21-2019, 10:26 AM
Good day folks,
I started last week my first webscraping project. The code is working (beginner level), but I have some struggle to continue it. It would be nice if you could give some advice or give a step into the right direction.
1. looping over different webpages
On the webpage there are several pages that I need to loop over. I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.
2. problem with the
I would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like
Thanks for the assistance,
Regards,
I started last week my first webscraping project. The code is working (beginner level), but I have some struggle to continue it. It would be nice if you could give some advice or give a step into the right direction.
from bs4 import BeautifulSoup from requests import get import pandas as pd import itertools import matplotlib.pyplot as plt import seaborn as sns from time import sleep from random import randint sns.set() headers = ({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}) for page in range (0,5): link = "https://www.otodom.pl/wynajem/mieszkanie/krakow/?search%5Bdescription%5D=1&search%5Bsubregion_id%5D=410&search%5Bcity_id%5D=38#form&page=(page)" response = get(link, headers = headers) print(response) html_soup = BeautifulSoup(response.text, 'html.parser') house_containers = html_soup.find_all('div', class_="offer-item-details") list_houseprice = [] idlist = [] for house in house_containers: hp = house.find('li', class_="offer-item-price").text hp = hp.replace('\n','') hp = hp.replace('/mc','') hp = hp.replace(" ", "") houseprice.append(hp) id = house.find('a',class_="button-observed observe-link favourites-button observed-text") sleep(randint(1,2))
1. looping over different webpages
On the webpage there are several pages that I need to loop over. I have added to the weblink "&page=(page)", but this does not seem to work as it only scrape one page.
2. problem with the
id
variableI would like to retrieve the data-id from the below HTML code. But this is a little tricky as it seems it is not part from the class and therefore I am not able to retrieve it like
hp
<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="45264798" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">
<span class="icon observed-45264798"></span>
<svg width="20px" height="20px" viewBox="0 0 52 47" version="1.1" xmlns="http://www.w3.org/2000/svg">
<defs></defs>
<g id="14_SubAccounts" stroke="none" stroke-width="1" fill-rule="evenodd">
<g id="#-styleguide" transform="translate(-290.000000, -3431.000000)" fill-rule="nonzero">
<g id="heart-regular" transform="translate(290.000000, 3417.000000)">
<g id="Group-7" transform="translate(0.000000, 14.000000)">
<g id="Group-6">
<path fill="none" d="M44.9749792,5.0327098 C39.5562332,0.472655041 31.1843643,1.15760154 25.9999932,6.44013411 C20.815622,1.15760154 12.4437532,0.463272213 7.0250072,5.0327098 C-0.0249875872,10.9720404 1.00626165,20.6551196 6.03125794,25.7875269 L22.4749958,42.5546419 C23.4124951,43.5116904 24.6687442,44.0465116 25.9999932,44.0465116 C27.3406172,44.0465116 28.5874913,43.5210732 29.5249906,42.5640247 L45.9687284,25.7969098 C50.9843497,20.6645025 52.0343489,10.9814232 44.9749792,5.0327098 Z" id="Path1"></path>
<path fill="black" d="M46.2718892,3.50241459 C53.7431472,9.79821295 53.7229457,20.7330078 47.4057954,27.197245 L30.9628717,43.9635293 C29.6487574,45.3050462 27.8867501,46.04647 26.009144,46.04647 C24.1408152,46.04647 22.3695113,45.2956437 21.0562304,43.9549771 L4.61132027,27.1866659 C-1.70790147,20.7323686 -1.72137807,9.79370898 5.74484956,3.50371964 C11.4787764,-1.33149882 20.0613911,-1.07727526 26.0092656,3.74139162 C31.9542156,-1.07326776 40.5355381,-1.32491546 46.2718892,3.50241459 Z M27.4365577,7.8409816 L26.009144,9.29542206 L24.5817303,7.8409816 C20.0856339,3.25975559 12.8944426,2.70707366 8.32274593,6.56222409 C2.70796719,11.2924442 2.7181226,19.5353679 7.46832495,24.3871085 L23.9128769,41.155054 C24.4776636,41.731618 25.2204435,42.04647 26.009144,42.04647 C26.8109459,42.04647 27.5432365,41.7383341 28.1062252,41.1636062 L44.5474853,24.3990223 C49.2979771,19.5379178 49.3133455,11.2966718 43.6958622,6.56249388 C39.1208731,2.7129779 31.9290821,3.26339526 27.4365577,7.8409816 Z" id="Path2"></path>
</g>
</g>
</g>
</g>
</g>
</svg>
<div class="observed-label">Dodaj do ulubionych</div>
</div>
</a>
<div class="observed-text-container" style="display: flex;">
<span class="icon observed-45264798"></span>
<svg width="20px" height="20px" viewBox="0 0 52 47" version="1.1" xmlns="http://www.w3.org/2000/svg">
<defs></defs>
<g id="14_SubAccounts" stroke="none" stroke-width="1" fill-rule="evenodd">
<g id="#-styleguide" transform="translate(-290.000000, -3431.000000)" fill-rule="nonzero">
<g id="heart-regular" transform="translate(290.000000, 3417.000000)">
<g id="Group-7" transform="translate(0.000000, 14.000000)">
<g id="Group-6">
<path fill="none" d="M44.9749792,5.0327098 C39.5562332,0.472655041 31.1843643,1.15760154 25.9999932,6.44013411 C20.815622,1.15760154 12.4437532,0.463272213 7.0250072,5.0327098 C-0.0249875872,10.9720404 1.00626165,20.6551196 6.03125794,25.7875269 L22.4749958,42.5546419 C23.4124951,43.5116904 24.6687442,44.0465116 25.9999932,44.0465116 C27.3406172,44.0465116 28.5874913,43.5210732 29.5249906,42.5640247 L45.9687284,25.7969098 C50.9843497,20.6645025 52.0343489,10.9814232 44.9749792,5.0327098 Z" id="Path1"></path>
<path fill="black" d="M46.2718892,3.50241459 C53.7431472,9.79821295 53.7229457,20.7330078 47.4057954,27.197245 L30.9628717,43.9635293 C29.6487574,45.3050462 27.8867501,46.04647 26.009144,46.04647 C24.1408152,46.04647 22.3695113,45.2956437 21.0562304,43.9549771 L4.61132027,27.1866659 C-1.70790147,20.7323686 -1.72137807,9.79370898 5.74484956,3.50371964 C11.4787764,-1.33149882 20.0613911,-1.07727526 26.0092656,3.74139162 C31.9542156,-1.07326776 40.5355381,-1.32491546 46.2718892,3.50241459 Z M27.4365577,7.8409816 L26.009144,9.29542206 L24.5817303,7.8409816 C20.0856339,3.25975559 12.8944426,2.70707366 8.32274593,6.56222409 C2.70796719,11.2924442 2.7181226,19.5353679 7.46832495,24.3871085 L23.9128769,41.155054 C24.4776636,41.731618 25.2204435,42.04647 26.009144,42.04647 C26.8109459,42.04647 27.5432365,41.7383341 28.1062252,41.1636062 L44.5474853,24.3990223 C49.2979771,19.5379178 49.3133455,11.2966718 43.6958622,6.56249388 C39.1208731,2.7129779 31.9290821,3.26339526 27.4365577,7.8409816 Z" id="Path2"></path>
</g>
</g>
</g>
</g>
</g>
</svg>
<div class="observed-label">Dodaj do ulubionych</div>
</div>
</a>
Regards,