Python Forum

Full Version: Web Crawler help
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
Hi folks,

I'm new to python and to this forum. Background, I started coding recently to make my own life easier by automating as much in my life as possible. As a result, I don't have much experience but I am doing my best to catch up. 

I made a web crawler to extract info about houses for sale. Each page on the housing site contains 15 houses. 

(note that i have added some spaces in the web url, otherwise i could not make a forum post)

import requests
from bs4 import BeautifulSoup
 
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'ht tp://www. funda. nl/koop/rotterdam/p' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        for x in soup.find_all('h3', {"class": "search-result-title"}):
            location = x.get_text(strip=True)
        for y in soup.find_all('div', {'class':'search-result-info search-result-info-price'}):
            price = y.get_text(strip=True)
        for z in soup.find_all('ul', {'class': 'search-result-kenmerken'}):
            size = z.get_text(strip=True)
 
            print(location +";" + price +";"+ size)
        page += 1
 
 
fundaSpider(2)
The output of this code returns the "location" and the " price" of the first listning for each of the 15 listings per page and the " size" correctly for each of the 15 listings per page. 

Output:
Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;67 m²/138 m²3 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;135 m²/171 m²5 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;102 m²/102 m²5 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;143 m²/127 m²5 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;90 m²/131 m²4 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;125 m²/270 m²5 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;76 m²/317 m²3 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;245 m²/190 m²6 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;225 m²/709 m²6 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;123 m²/103 m²6 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;125 m²/153 m²5 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;92 m²/101 m²4 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;80 m²/86 m²4 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;160 m²5 kamers Slaak 1123061 CZ Rotterdam;€ 175.000 k.k.;66 m²3 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;180 m²5 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;76 m²2 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;64 m²3 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;99 m²4 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;46 m²2 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;120 m²4 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;103 m²5 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;83 m²2 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;83 m²3 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;82 m²3 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;68 m²3 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;85 m²3 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;110 m²/160 m²5 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;129 m²/224 m²6 kamers Geelkruid 953068 DT Rotterdam;€ 269.000 k.k.;125 m²/142 m²5 kamers
Below I have included the HTML code of the webpage for one of the listings.



Output:
</div>             </a>     </div>         <div class="search-result-content">             <div class="search-result-content-inner"> <div class="search-result-header">         <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249">             <h3 class="search-result-title">                 [color=#ff3333]Scottstraat 3[/color]                 <small class="search-result-subtitle">                     [color=#ff3333]3076 GX Rotterdam[/color]                 </small>             </h3>         </a> </div>    <div class="search-result-info search-result-info-price">             <span class="search-result-price">[color=#ff3333]€ 165.000 k.k.[/color]</span>                     </div> <div class="search-result-info">     <ul class="search-result-kenmerken ">             <li>                     <span title="Woonoppervlakte">[color=#ff3333]67 m²[/color]</span>                                      /                                      <span title="Perceeloppervlakte">138 m²</span>             </li>                                     <li>[color=#ff3333]3 kamers[/color]</li>        </ul> </div>
My questions:

1. My output is not correct, I should get the location price and size returned for each listing per page. I know that probably my 3 " for" loops are not correct, Ii have tried several things but I am getting many different variants of solutions that are not correct. 

2. Currently the information for "Location" and " Size" comes together. 

Location: Slaak 1123061 CZ Rotterdam;
Price: € 175.000 k.k.;
Size: 135 m²/171 m²5 kamers

I would prefer to extract this separately:

Slaak 112
3061 CZ Rotterdam
€ 175.000 k.k.
135 m²/171 m²
5 kamers

Any tips are welcome. All help is much appreciated!
In [73]: for i, item in enumerate(soup.find_all('div', class_='search-result-content-inner'), 1):
    ...:     print(item.find('h3').text.strip().split('\n')[0].strip())
    ...:     print(item.h3.small.text.strip(), '\n')
    ...:     
    ...:         
Scottstraat 3
3076 GX Rotterdam 

Jan Meertensstraat 3
3065 PB Rotterdam 

Yersekestraat 36
3086 SG Rotterdam 

Korendijk 129
3079 PW Rotterdam 

Amer 6
3068 GA Rotterdam 

Port-Saidstraat 150
3067 MV Rotterdam 

John Mottweg 100
3069 VT Rotterdam 

Brandingdijk 276
3059 RB Rotterdam 

Abraham van der Knaapkade 9
3059 SP Rotterdam 

Koenraad van Zwabenstraat 40
3077 WJ Rotterdam 

Jeneverbes 6
3069 LP Rotterdam 

Lamastraat 56
3064 LL Rotterdam 

Zeeuwsestraat 2 + PP
3074 TT Rotterdam 

Oudedijk 187 A
3061 AD Rotterdam 

Slaak 112
3061 CZ Rotterdam 
Quote:
        for x in soup.find_all('h3', {"class": "search-result-title"}):
            location = x.get_text(strip=True)
        for y in soup.find_all('div', {'class':'search-result-info search-result-info-price'}):
            price = y.get_text(strip=True)
        for z in soup.find_all('ul', {'class': 'search-result-kenmerken'}):
            size = z.get_text(strip=True)
  
            print(location +";" + price +";"+ size)
Because you have 3 loops for each, and print on your last loop, it only grabs the other previous loops last value, thus why they are all the same. You should grab the main tag of the ad and then iterate those tags anyways.

Also you should be using format method regardless of your python version

I made a quick script extracting the data you requested. Some sections were tinkering back and forth to tweak it to get it as expected.

import requests
from bs4 import BeautifulSoup

SEP = '*'*80
  
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li',{'class':'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n',strip=True).split()[:-3]) #sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class':'search-result-info search-result-info-price'}).text.strip()
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            room = li[1].text.strip()

            print('title: {}\naddress: {}\nprice: {}\nsize: {}\nrooms: {}'.format(title, address, price, size, room))
            print(SEP)


        page += 1
  
  
fundaSpider(2)
Thank you very much for the very helpful replies! Really motivating to get this help, keeps the learning curve going!
Why I am using enumerate here...! Big Grin
It's a remainder of counting the returned results. Blush
thanks to your help i have been able to extract the data and export export everything to a csv file. Offcourse this now tasts like more and i would like to access the url's of the individual listings to extract more detailed info for each of the listings. 

In the main source code this means i need to scrap this url for each of the listings (see blue)

</div>
            </a>
    </div>
        <div class="search-result-content">
            <div class="search-result-content-inner">
<div class="search-result-header">
        <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249">
            <h3 class="search-result-title">
                Scottstraat 3
                <small class="search-result-subtitle">
                    3076 GX Rotterdam
                </small>
            </h3>
        </a>
</div>    <div class="search-result-info search-result-info-price">
            <span class="search-result-price">€ 165.000 k.k.</span>
                    </div>
<div class="search-result-info">
    <ul class="search-result-kenmerken ">
            <li>
                    <span title="Woonoppervlakte">67 m²</span>
                                     / 
                                    <span title="Perceeloppervlakte">138 m²</span>
            </li>
                                    <li>3 kamers</li>
       </ul>
</div>

I am using the code as posted by metulburr and added this line below "room"
href = ad.find('a', {'class': 'search-result-header'}).link.get('href', {})
but then i get the following error message:

Error:
    href = ad.find('a', {'class': 'search-result-header'}).link.get('href', {}) AttributeError: 'NoneType' object has no attribute 'get'
i have tried several things, for example

href = 'www.funda.nl' + ad.find('a')
but non-successful so far. 

As always, help is much appreciated!
It's div and class first then,find a with href.
from bs4 import BeautifulSoup

html = '''\
<div class="search-result-header">
 <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249">
   <h3 class="search-result-title">Scottstraat 3
     <small class="search-result-subtitle">3076 GX Rotterdam</small>
   </h3>
 </a>
</div>'''

soup = BeautifulSoup(html, 'html.parser')
result = soup.find('div', class_="search-result-header")
link = result.find('a').get('href')
print('www.funda.nl{}'.format(link))
Output:
www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/
you can use
ad.find_all('a')[2]['href']
because its actually the 3rd, not the first a tag
to get the path, and then join with with 
urllib.parse.urljoin 
Thanks a million!

I have added a new def that searches through the pages of the individual listings:

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for item in soup.findAll('li', {'class': 'breadcrumb-listitem'} ):
        area = item.find('a', )
        print(area)
I try to get one specific piece of information (neighborhood). I succeeded in getting the info I am looking for in the output, but with a lot of stuff that I don't want. (i only want the "title" (in the first example " Lombardijen")

the output for each listing is as follows (only copied the result for the first 3 listings). 

Output:
<a href="/koop/" title="Home">Home</a> <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a> <a href="/koop/rotterdam/lombardijen/" title="[color=#333333]Lombardijen[/color]">Lombardijen</a> None <a href="/koop/" title="Home">Home</a> <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a> <a href="/koop/rotterdam/s-gravenland/" title="'s-Gravenland">'s-Gravenland</a> None <a href="/koop/" title="Home">Home</a> <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a> <a href="/koop/rotterdam/pendrecht/" title="Pendrecht">Pendrecht</a> None
The HTML code for each listing is looking as follows, I have made red what I tried to extract.

<div class="breadcrumb">
        <ol class="container breadcrumb-list">
                <li class="breadcrumb-listitem">
                        <a href="/koop/" title="Home">Home</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <a href="/koop/rotterdam/lombardijen/" title="Lombardijen">Lombardijen</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <span title="Scottstraat 3">Scottstraat 3</span>

                </li>
        </ol>

Hope this part of my puzzle can also be solved. Tnx again for the help
import requests
from bs4 import BeautifulSoup


html = '''
<div class="breadcrumb">
        <ol class="container breadcrumb-list">
                <li class="breadcrumb-listitem">
                        <a href="/koop/" title="Home">Home</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <a href="/koop/rotterdam/lombardijen/" title="Lombardijen">Lombardijen</a>

                        <span class="icon-arrow-right-grey"></span>
                </li>
                <li class="breadcrumb-listitem">
                        <span title="Scottstraat 3">Scottstraat 3</span>

                </li>
        </ol>
'''

soup = BeautifulSoup(html, 'html.parser')
li = soup.find_all('li', {'class': 'breadcrumb-listitem'} )
print(li[2].a.text)
EDIT:
oh whoops thats the title not text
print(li[2].a['title'])
Pages: 1 2 3 4