Python Forum

Pages: 1 2 3 4

Hi folks,

I'm new to python and to this forum. Background, I started coding recently to make my own life easier by automating as much in my life as possible. As a result, I don't have much experience but I am doing my best to catch up.

I made a web crawler to extract info about houses for sale. Each page on the housing site contains 15 houses.

(note that i have added some spaces in the web url, otherwise i could not make a forum post)

import requests
from bs4 import BeautifulSoup
 
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'ht tp://www. funda. nl/koop/rotterdam/p' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        for x in soup.find_all('h3', {"class": "search-result-title"}):
            location = x.get_text(strip=True)
        for y in soup.find_all('div', {'class':'search-result-info search-result-info-price'}):
            price = y.get_text(strip=True)
        for z in soup.find_all('ul', {'class': 'search-result-kenmerken'}):
            size = z.get_text(strip=True)
 
            print(location +";" + price +";"+ size)
        page += 1
 
 
fundaSpider(2)

The output of this code returns the "location" and the " price" of the first listning for each of the 15 listings per page and the " size" correctly for each of the 15 listings per page.

Output:Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;67 m²/138 m²3 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;135 m²/171 m²5 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;102 m²/102 m²5 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;143 m²/127 m²5 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;90 m²/131 m²4 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;125 m²/270 m²5 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;76 m²/317 m²3 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;245 m²/190 m²6 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;225 m²/709 m²6 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;123 m²/103 m²6 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;125 m²/153 m²5 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;92 m²/101 m²4 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;80 m²/86 m²4 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;160 m²5 kamers
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;66 m²3 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;180 m²5 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;76 m²2 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;64 m²3 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;99 m²4 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;46 m²2 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;120 m²4 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;103 m²5 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;83 m²2 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;83 m²3 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;82 m²3 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;68 m²3 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;85 m²3 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;110 m²/160 m²5 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;129 m²/224 m²6 kamers
Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;125 m²/142 m²5 kamers

Below I have included the HTML code of the webpage for one of the listings.

Output:</div>
            </a>
    </div>
        <div class="search-result-content">
            <div class="search-result-content-inner">
<div class="search-result-header">
        <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249">
            <h3 class="search-result-title">
                [color=#ff3333]Scottstraat 3[/color]
                <small class="search-result-subtitle">
                    [color=#ff3333]3076 GX Rotterdam[/color]
                </small>
            </h3>
        </a>
</div>    <div class="search-result-info search-result-info-price">
            <span class="search-result-price">[color=#ff3333]€ 165.000 k.k.[/color]</span>
                    </div>
<div class="search-result-info">
    <ul class="search-result-kenmerken ">
            <li>
                    <span title="Woonoppervlakte">[color=#ff3333]67 m²[/color]</span>
                                     / 
                                    <span title="Perceeloppervlakte">138 m²</span>
            </li>
                                    <li>[color=#ff3333]3 kamers[/color]</li>
       </ul>
</div>

My questions:

1. My output is not correct, I should get the location price and size returned for each listing per page. I know that probably my 3 " for" loops are not correct, Ii have tried several things but I am getting many different variants of solutions that are not correct.

2. Currently the information for "Location" and " Size" comes together.

Location: Slaak 1123061 CZ Rotterdam;
Price: € 175.000 k.k.;
Size: 135 m²/171 m²5 kamers

I would prefer to extract this separately:

Slaak 112
3061 CZ Rotterdam
€ 175.000 k.k.
135 m²/171 m²
5 kamers

Any tips are welcome. All help is much appreciated!

In [73]: for i, item in enumerate(soup.find_all('div', class_='search-result-content-inner'), 1):
    ...:     print(item.find('h3').text.strip().split('\n')[0].strip())
    ...:     print(item.h3.small.text.strip(), '\n')
    ...:     
    ...:         
Scottstraat 3
3076 GX Rotterdam 

Jan Meertensstraat 3
3065 PB Rotterdam 

Yersekestraat 36
3086 SG Rotterdam 

Korendijk 129
3079 PW Rotterdam 

Amer 6
3068 GA Rotterdam 

Port-Saidstraat 150
3067 MV Rotterdam 

John Mottweg 100
3069 VT Rotterdam 

Brandingdijk 276
3059 RB Rotterdam 

Abraham van der Knaapkade 9
3059 SP Rotterdam 

Koenraad van Zwabenstraat 40
3077 WJ Rotterdam 

Jeneverbes 6
3069 LP Rotterdam 

Lamastraat 56
3064 LL Rotterdam 

Zeeuwsestraat 2 + PP
3074 TT Rotterdam 

Oudedijk 187 A
3061 AD Rotterdam 

Slaak 112
3061 CZ Rotterdam

Quote:

        for x in soup.find_all('h3', {"class": "search-result-title"}):
            location = x.get_text(strip=True)
        for y in soup.find_all('div', {'class':'search-result-info search-result-info-price'}):
            price = y.get_text(strip=True)
        for z in soup.find_all('ul', {'class': 'search-result-kenmerken'}):
            size = z.get_text(strip=True)
  
            print(location +";" + price +";"+ size)

Because you have 3 loops for each, and print on your last loop, it only grabs the other previous loops last value, thus why they are all the same. You should grab the main tag of the ad and then iterate those tags anyways.

Also you should be using format method regardless of your python version

I made a quick script extracting the data you requested. Some sections were tinkering back and forth to tweak it to get it as expected.

import requests
from bs4 import BeautifulSoup

SEP = '*'*80
  
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li',{'class':'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n',strip=True).split()[:-3]) #sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class':'search-result-info search-result-info-price'}).text.strip()
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            room = li[1].text.strip()

            print('title: {}\naddress: {}\nprice: {}\nsize: {}\nrooms: {}'.format(title, address, price, size, room))
            print(SEP)


        page += 1
  
  
fundaSpider(2)

OUTPUT

Output:metulburr@ubuntu:~$ python3 test2.py
title: Scottstraat 3
address: 3076 GX Rotterdam
price: € 165.000 k.k.
size: 67 m²/138 m²
rooms: 3 kamers
********************************************************************************
title: Jan Meertensstraat 3
address: 3065 PB Rotterdam
price: € 348.000 k.k.
size: 135 m²/171 m²
rooms: 5 kamers
********************************************************************************
title: Yersekestraat 36
address: 3086 SG Rotterdam
price: € 215.000 k.k.
size: 102 m²/102 m²
rooms: 5 kamers
********************************************************************************
title: Korendijk 129
address: 3079 PW Rotterdam
price: € 267.500 k.k.
size: 143 m²/127 m²
rooms: 5 kamers
********************************************************************************
title: Amer 6
address: 3068 GA Rotterdam
price: € 189.000 k.k.
size: 90 m²/131 m²
rooms: 4 kamers
********************************************************************************
title: Port-Saidstraat 150
address: 3067 MV Rotterdam
price: € 299.500 k.k.
size: 125 m²/270 m²
rooms: 5 kamers
********************************************************************************
title: John Mottweg 100
address: 3069 VT Rotterdam
price: € 239.500 k.k.
size: 76 m²/317 m²
rooms: 3 kamers
********************************************************************************
title: Brandingdijk 276
address: 3059 RB Rotterdam
price: € 479.000 k.k.
size: 245 m²/190 m²
rooms: 6 kamers
********************************************************************************
title: Abraham van der Knaapkade 9
address: 3059 SP Rotterdam
price: € 895.000 k.k.
size: 225 m²/709 m²
rooms: 6 kamers
********************************************************************************
title: Koenraad van Zwabenstraat 40
address: 3077 WJ Rotterdam
price: € 199.500 k.k.
size: 123 m²/103 m²
rooms: 6 kamers
********************************************************************************
title: Jeneverbes 6
address: 3069 LP Rotterdam
price: € 225.000 k.k.
size: 125 m²/153 m²
rooms: 5 kamers
********************************************************************************
title: Lamastraat 56
address: 3064 LL Rotterdam
price: € 175.000 k.k.
size: 92 m²/101 m²
rooms: 4 kamers
********************************************************************************
title: Zeeuwsestraat 2 + PP
address: 3074 TT Rotterdam
price: € 169.500 k.k.
size: 80 m²/86 m²
rooms: 4 kamers
********************************************************************************
title: Oudedijk 187 A
address: 3061 AD Rotterdam
price: € 348.000 k.k.
size: 160 m²
rooms: 5 kamers
********************************************************************************
title: Slaak 112
address: 3061 CZ Rotterdam
price: € 175.000 k.k.
size: 66 m²
rooms: 3 kamers
********************************************************************************
title: Snellinckstraat 15 A
address: 3021 WB Rotterdam
price: € 439.000 k.k.
size: 180 m²
rooms: 5 kamers
********************************************************************************
title: Botersloot 325
address: 3011 HE Rotterdam
price: € 275.000 k.k.
size: 76 m²
rooms: 2 kamers
********************************************************************************
title: Pleinweg 71 d
address: 3081 JE Rotterdam
price: € 87.500 k.k.
size: 64 m²
rooms: 3 kamers
********************************************************************************
title: Bierstraat 245
address: 3011 XA Rotterdam
price: € 325.000 k.k.
size: 99 m²
rooms: 4 kamers
********************************************************************************
title: Karel Doormanstraat 382 d
address: 3012 GR Rotterdam
price: € 200.000 k.k.
size: 46 m²
rooms: 2 kamers
********************************************************************************
title: Laan op Zuid 848
address: 3071 AC Rotterdam
price: € 260.000 k.k.
size: 120 m²
rooms: 4 kamers
********************************************************************************
title: Nieuwenhoornstraat 57 A
address: 3082 VC Rotterdam
price: € 139.500 k.k.
size: 103 m²
rooms: 5 kamers
********************************************************************************
title: Siciliëboulevard 536
address: 3059 XT Rotterdam
price: € 265.000 k.k.
size: 83 m²
rooms: 2 kamers
********************************************************************************
title: Laan van Avant-Garde 292
address: 3059 RA Rotterdam
price: € 235.000 k.k.
size: 83 m²
rooms: 3 kamers
********************************************************************************
title: Nieuwenoord 39
address: 3079 LH Rotterdam
price: € 139.000 k.k.
size: 82 m²
rooms: 3 kamers
********************************************************************************
title: Bredenoord 125
address: 3079 JC Rotterdam
price: € 139.500 k.k.
size: 68 m²
rooms: 3 kamers
********************************************************************************
title: Siciliëboulevard 504
address: 3059 XT Rotterdam
price: € 250.000 k.k.
size: 85 m²
rooms: 3 kamers
********************************************************************************
title: Kadoelermeer 37
address: 3068 KE Rotterdam
price: € 189.000 k.k.
size: 110 m²/160 m²
rooms: 5 kamers
********************************************************************************
title: Stekelbrem 68
address: 3068 TD Rotterdam
price: € 300.000 k.k.
size: 129 m²/224 m²
rooms: 6 kamers
********************************************************************************
title: Geelkruid 95
address: 3068 DT Rotterdam
price: € 269.000 k.k.
size: 125 m²/142 m²
rooms: 5 kamers
********************************************************************************

Thank you very much for the very helpful replies! Really motivating to get this help, keeps the learning curve going!

Why I am using enumerate here...! Big Grin

It's a remainder of counting the returned results. Blush

thanks to your help i have been able to extract the data and export export everything to a csv file. Offcourse this now tasts like more and i would like to access the url's of the individual listings to extract more detailed info for each of the listings.

In the main source code this means i need to scrap this url for each of the listings (see blue)

</div>
</a>
</div>
<div class="search-result-content">
<div class="search-result-content-inner">
<div class="search-result-header">
<a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249">
<h3 class="search-result-title">
Scottstraat 3
<small class="search-result-subtitle">
3076 GX Rotterdam
</small>
</h3>
</a>
</div> <div class="search-result-info search-result-info-price">
<span class="search-result-price">€ 165.000 k.k.</span>
</div>
<div class="search-result-info">
<ul class="search-result-kenmerken ">
<li>
<span title="Woonoppervlakte">67 m²</span>
/
<span title="Perceeloppervlakte">138 m²</span>
</li>
<li>3 kamers</li>
</ul>
</div>

I am using the code as posted by metulburr and added this line below "room"

href = ad.find('a', {'class': 'search-result-header'}).link.get('href', {})

but then i get the following error message:

Error:    href = ad.find('a', {'class': 'search-result-header'}).link.get('href', {})

AttributeError: 'NoneType' object has no attribute 'get'

i have tried several things, for example

href = 'www.funda.nl' + ad.find('a')

but non-successful so far.

As always, help is much appreciated!

It's div and class first then,find a with href.

from bs4 import BeautifulSoup

html = '''\
<div class="search-result-header">
 <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249">
   <h3 class="search-result-title">Scottstraat 3
     <small class="search-result-subtitle">3076 GX Rotterdam</small>
   </h3>
 </a>
</div>'''

soup = BeautifulSoup(html, 'html.parser')
result = soup.find('div', class_="search-result-header")
link = result.find('a').get('href')
print('www.funda.nl{}'.format(link))

Output:
www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/

you can use

ad.find_all('a')[2]['href']

because its actually the 3rd, not the first a tag
to get the path, and then join with with
urllib.parse.urljoin

Thanks a million!

I have added a new def that searches through the pages of the individual listings:

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for item in soup.findAll('li', {'class': 'breadcrumb-listitem'} ):
        area = item.find('a', )
        print(area)

I try to get one specific piece of information (neighborhood). I succeeded in getting the info I am looking for in the output, but with a lot of stuff that I don't want. (i only want the "title" (in the first example " Lombardijen")

the output for each listing is as follows (only copied the result for the first 3 listings).

Output:<a href="/koop/" title="Home">Home</a>
<a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>
<a href="/koop/rotterdam/lombardijen/" title="[color=#333333]Lombardijen[/color]">Lombardijen</a>
None
<a href="/koop/" title="Home">Home</a>
<a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>
<a href="/koop/rotterdam/s-gravenland/" title="'s-Gravenland">'s-Gravenland</a>
None
<a href="/koop/" title="Home">Home</a>
<a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>
<a href="/koop/rotterdam/pendrecht/" title="Pendrecht">Pendrecht</a>
None

The HTML code for each listing is looking as follows, I have made red what I tried to extract.

<div class="breadcrumb">
<ol class="container breadcrumb-list">
<li class="breadcrumb-listitem">
<a href="/koop/" title="Home">Home</a>

<span class="icon-arrow-right-grey"></span>
</li>
<li class="breadcrumb-listitem">
<a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>

<span class="icon-arrow-right-grey"></span>
</li>
<li class="breadcrumb-listitem">
<a href="/koop/rotterdam/lombardijen/" title="Lombardijen">Lombardijen</a>

<span class="icon-arrow-right-grey"></span>
</li>
<li class="breadcrumb-listitem">
<span title="Scottstraat 3">Scottstraat 3</span>

</li>
</ol>

Hope this part of my puzzle can also be solved. Tnx again for the help

import requests
from bs4 import BeautifulSoup


html = '''
<div class="breadcrumb">
        <ol class="container breadcrumb-list">
                <li class="breadcrumb-listitem">
                        <a href="/koop/" title="Home">Home</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <a href="/koop/rotterdam/lombardijen/" title="Lombardijen">Lombardijen</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <span title="Scottstraat 3">Scottstraat 3</span>

                </li>
        </ol>
'''

soup = BeautifulSoup(html, 'html.parser')
li = soup.find_all('li', {'class': 'breadcrumb-listitem'} )
print(li[2].a.text)

EDIT:
oh whoops thats the title not text

print(li[2].a['title'])

Pages: 1 2 3 4

takaa

wavic

metulburr

takaa

wavic

takaa

snippsat

metulburr

takaa

metulburr