Web Crawler help

***metulburr*** · (This post was last modified: Feb-10-2017, 01:06 PM by metulburr.)

            href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            area = get_single_item_data_3(href)
            if not area:
                area = 'None'
            print(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href)

a function always returns something, whether it is your value or None. Just check the value and reassign it to something as a string if it is none.

You should use format method too. It is much more readable, and easier to maintain. As well as it is now the standard. Concatenation looks horrible.

print('{},{},{},{},{},{},{}'.format(title, address, price, size, room, area, href))

If you want to change the output to not even show area value when its none, then do somethign like

            area = get_single_item_data_3(href)
            if area:
                print('{},{},{},{},{},{},{}'.format(title, address, price, size, room, area, href))
            else:
                print('{},{},{},{},{},{}'.format(title, address, price, size, room, href))

takaa · (This post was last modified: Feb-14-2017, 02:11 PM by metulburr.)

Expanding my horizon from the properties currently for sale to the properties already sold I have run into a challenge.

in the for sale link " http://www.funda.nl/koop/rotterdam/p1 " all information from the properties can be found in

ads = soup.find_all('li', {'class': 'search-result'})

In the link with the houses sold " http://www.funda.nl/nl/koop/verkocht/rotterdam/p1 " is each property listed under a different name
(even/uneven and the name of the real estate broker (often "nvm" but can also be another one)

What is an elegant solution to search through these different classes for each page?

SaveSave

***metulburr*** · (This post was last modified: Feb-14-2017, 02:45 PM by metulburr.)

Im not really sure what your current code is. Often if you are obtaining sub-url its best to clean the code up to not get confused.

Quote:In the link with the houses sold " http://www.funda.nl/nl/koop/verkocht/rotterdam/p1 " is each property listed under a different name

(even/uneven and the name of the real estate broker (often "nvm" but can also be another one)

There are a couple ways. The root search for that is going to be
ul = soup.find('ul', {'class':'object-list'})
Now you can list ul's li's via ul.find_all('li') and just go through the list of li tags. Or if you need to you can go through them one by one via find_next_sibling() to get the next li tag such as

from bs4 import BeautifulSoup
import requests

url = 'http://www.funda.nl/koop/verkocht/rotterdam/p1/'


req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
ul = soup.find('ul', {'class':'object-list'}) 
print(ul.li) #first li ;even nvm sold class
li2 = ul.li.find_next_sibling()
print(li2)  #second li; odd nvm sold class

ul.li is pretty much ul.find('li')

so if you did li2.find_next_sibling().find_next_sibling() it would actually be the ad class. OR if you did find_all it would be the 4th index of li tags

EDIT:
if you just wanted to get a list of li tags with *sold* then you can use regex

li_sold = ul.find_all('li',class_=re.compile('sold'))

Here this will grab everything except the ad class one.

'sold' would have to be the keyword as everything else changes. (if your trying to get them all). IF your trying to get only even classes then swap sold for even in regex

takaa · Feb-14-2017, 02:54 PM

Quote:EDIT:

if you just wanted to get a list of li tags with *sold* then you can use regex
li_sold = ul.find_all('li',class_=re.compile('sold'))

I think this is exactly what I need, all the information I need is included in here. Tnx! another thing learned.

takaa · Feb-15-2017, 11:02 AM

In this html text i want to extract the red highlighted text. The green highlighted text i extracted successfully, but i have difficulties isolating the second . I tried several things (for example see code below) but i dont get the desired outcome.

<ul class="properties-list">

<li>
3067 JH
Rotterdam

Verkocht

</li>
<li>


90 m²
·

4 kamers

</li>

from bs4 import BeautifulSoup
import requests
import re
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
 
        url = 'http://www.funda.nl/koop/verkocht/rotterdam/p{}'.format(page)
 
        req = requests.get(url)
        soup = BeautifulSoup(req.text, 'html.parser')
        ul = soup.find('ul', {'class': 'object-list'})
 
 
        li_sold = ul.find_all('li',class_=re.compile('sold'))
        for ad in li_sold:
 
 
            address = (ad.find('a', {'class': 'object-street'}).text.strip())
            size_results = ad.find('ul', {'class': 'properties-list'})
            li = size_results.find_all('li')
            size = li[1]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1].text.strip()
            room = room.split(" ")[2]

Output:·
 
 
4
 
·
 
 
2
 
·
 
 
2
 
·
 
 
3
 
·
 
 
4
 
·
 
 
4
 
·
 
 
4
 
/
 
 
155
 
/
 
 
130
 
/
 
 
156
 
·
 
 
[size=medium][font=Calibri]4[/font][/size]

Any tips are much appreciated!

***metulburr*** · Feb-15-2017, 01:18 PM

whatever the 90 was, you can call find_next_siblng('span') on that and it will move to the next span tag

takaa · (This post was last modified: Feb-15-2017, 01:57 PM by takaa.)

the object size is giving me the 90

If I add find_next_sibling('span') on that i am getting an error

        size_results = ad.find('ul', {'class': 'properties-list'})
            li = size_results.find_all('li')
            size = li[1]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = size.find_next_sibling('span')

Error:    room = size.find_next_sibling('span')

AttributeError: 'str' object has no attribute 'find_next_sibling'

Since for some listings it is the 2nd span and for some listings, it is the 3rd span it will be best for me if I can address directly the span with the title " Aantal kamers", but in this i have been unsuccessful so far.

I tried

            size_results = ad.find('ul', {'class': 'properties-list'})
            li = size_results.find_all('li')
            size = li[1]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1]
            room = room.find_all('span', 'Aantal kamers')

which only returns

Output:[]
[]
[]
[]
[]
[]

***snippsat*** · (This post was last modified: Feb-15-2017, 03:50 PM by snippsat.)

The make your error,size is a string.

>>> size = ''
>>> size.find_next_sibling('span')

Error:Traceback (most recent call last):
  File "<string>", line 301, in runcode
  File "<interactive input>", line 1, in <module>
AttributeError: 'str' object has no attribute 'find_next_sibling'

Try to post better formatted HTML,Both CodePen and JSFiddle has Tidy HTML function.
One way to do it:

from bs4 import BeautifulSoup

html = '''\
<ul class="properties-list">
  <li>
    3067 JH Rotterdam
    <span class="item-sold-label-small" title="Verkocht">Verkocht</span>
  </li>
  <li>
    <span title="Woonoppervlakte">90 m²</span>
    <span title="Aantal&nbsp;kamers">4 kamers</span>
  </li>'''

soup = BeautifulSoup(html, 'lxml')
p_lst = soup.find(class_="properties-list")
span = p_lst.select('li > span')
print([item.text for item in span[1:]])

Output:
['90 m²', '4 kamers']

takaa · (This post was last modified: Feb-15-2017, 04:01 PM by takaa.)

and one more questions.

I also want to get a clean street name, without the numbers and other additions.

"title" returns the street name and house number etc. the code I tried is:

street = title.rpartition(' ')[0]

When the address is " street + number" this gives the desired output, but when the address is " street + number + addition" this gives me the street name AND the number.

Basically what i need is to get the string before the first space that is followed up by a number.
But this I am not yet able to produce.

(Feb-15-2017, 03:48 PM)snippsat Wrote: One way to do it:

Thanks for the new insight.
Is my thinking of just trying to take the text from each span with the title='Aantal kamers' not practical to code? It seemed so logical to just add somehow the criteria for title, but I didn't manage to do it.

***metulburr*** · Feb-15-2017, 06:03 PM

there are multiple ways to do the same thing

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,893	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia

Web Crawler help

User Panel Messages

Announcements