Posts: 5,150
Threads: 396
Joined: Sep 2016
Feb-10-2017, 01:06 PM
(This post was last modified: Feb-10-2017, 01:06 PM by metulburr.)
href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
area = get_single_item_data_3(href)
if not area:
area = 'None'
print(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href) a function always returns something, whether it is your value or None. Just check the value and reassign it to something as a string if it is none.
You should use format method too. It is much more readable, and easier to maintain. As well as it is now the standard. Concatenation looks horrible.
print('{},{},{},{},{},{},{}'.format(title, address, price, size, room, area, href)) If you want to change the output to not even show area value when its none, then do somethign like
area = get_single_item_data_3(href)
if area:
print('{},{},{},{},{},{},{}'.format(title, address, price, size, room, area, href))
else:
print('{},{},{},{},{},{}'.format(title, address, price, size, room, href))
Recommended Tutorials:
Posts: 42
Threads: 10
Joined: Feb 2017
Feb-14-2017, 01:49 PM
(This post was last modified: Feb-14-2017, 02:11 PM by metulburr.)
Expanding my horizon from the properties currently for sale to the properties already sold I have run into a challenge.
in the for sale link " http://www.funda.nl/koop/rotterdam/p1 " all information from the properties can be found in
ads = soup.find_all('li', {'class': 'search-result'}) In the link with the houses sold " http://www.funda.nl/nl/koop/verkocht/rotterdam/p1 " is each property listed under a different name
(even/uneven and the name of the real estate broker (often "nvm" but can also be another one)
What is an elegant solution to search through these different classes for each page?
SaveSave
Posts: 5,150
Threads: 396
Joined: Sep 2016
Feb-14-2017, 02:43 PM
(This post was last modified: Feb-14-2017, 02:45 PM by metulburr.)
Im not really sure what your current code is. Often if you are obtaining sub-url its best to clean the code up to not get confused.
Quote:In the link with the houses sold " http://www.funda.nl/nl/koop/verkocht/rotterdam/p1 " is each property listed under a different name
(even/uneven and the name of the real estate broker (often "nvm" but can also be another one)
There are a couple ways. The root search for that is going to be
ul = soup.find('ul', {'class':'object-list'})
Now you can list ul's li's via ul.find_all('li') and just go through the list of li tags. Or if you need to you can go through them one by one via find_next_sibling() to get the next li tag such as
from bs4 import BeautifulSoup
import requests
url = 'http://www.funda.nl/koop/verkocht/rotterdam/p1/'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
ul = soup.find('ul', {'class':'object-list'})
print(ul.li) #first li ;even nvm sold class
li2 = ul.li.find_next_sibling()
print(li2) #second li; odd nvm sold class ul.li is pretty much ul.find('li')
so if you did li2.find_next_sibling().find_next_sibling() it would actually be the ad class. OR if you did find_all it would be the 4th index of li tags
EDIT:
if you just wanted to get a list of li tags with *sold* then you can use regex
li_sold = ul.find_all('li',class_=re.compile('sold')) Here this will grab everything except the ad class one.
'sold' would have to be the keyword as everything else changes. (if your trying to get them all). IF your trying to get only even classes then swap sold for even in regex
Recommended Tutorials:
Posts: 42
Threads: 10
Joined: Feb 2017
Quote:EDIT:
if you just wanted to get a list of li tags with *sold* then you can use regex
li_sold = ul.find_all('li',class_=re.compile('sold'))
I think this is exactly what I need, all the information I need is included in here. Tnx! another thing learned.
Posts: 42
Threads: 10
Joined: Feb 2017
In this html text i want to extract the red highlighted text. The green highlighted text i extracted successfully, but i have difficulties isolating the second <span title>. I tried several things (for example see code below) but i dont get the desired outcome.
<ul class="properties-list">
<li>
3067 JH
Rotterdam
<span class="item-sold-label-small" title="Verkocht">Verkocht</span>
</li>
<li>
<span title="Woonoppervlakte"> 90 m²</span>
·
<span title="Aantal kamers"> 4 kamers</span>
</li>
from bs4 import BeautifulSoup
import requests
import re
def fundaSpider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.funda.nl/koop/verkocht/rotterdam/p{}'.format(page)
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
ul = soup.find('ul', {'class': 'object-list'})
li_sold = ul.find_all('li',class_=re.compile('sold'))
for ad in li_sold:
address = (ad.find('a', {'class': 'object-street'}).text.strip())
size_results = ad.find('ul', {'class': 'properties-list'})
li = size_results.find_all('li')
size = li[1]
size = size.get_text(strip=True)
size = size.split(" ")[0]
room = li[1].text.strip()
room = room.split(" ")[2] Output: ·
4
·
2
·
2
·
3
·
4
·
4
·
4
/
155
/
130
/
156
·
[size=medium][font=Calibri]4[/font][/size]
Any tips are much appreciated!
Posts: 5,150
Threads: 396
Joined: Sep 2016
whatever the 90 was, you can call find_next_siblng('span') on that and it will move to the next span tag
Recommended Tutorials:
Posts: 42
Threads: 10
Joined: Feb 2017
Feb-15-2017, 01:46 PM
(This post was last modified: Feb-15-2017, 01:57 PM by takaa.)
the object size is giving me the 90
If I add find_next_sibling('span') on that i am getting an error
size_results = ad.find('ul', {'class': 'properties-list'})
li = size_results.find_all('li')
size = li[1]
size = size.get_text(strip=True)
size = size.split(" ")[0]
room = size.find_next_sibling('span') Error: room = size.find_next_sibling('span')
AttributeError: 'str' object has no attribute 'find_next_sibling'
Since for some listings it is the 2nd span and for some listings, it is the 3rd span it will be best for me if I can address directly the span with the title " Aantal kamers", but in this i have been unsuccessful so far.
I tried
size_results = ad.find('ul', {'class': 'properties-list'})
li = size_results.find_all('li')
size = li[1]
size = size.get_text(strip=True)
size = size.split(" ")[0]
room = li[1]
room = room.find_all('span', 'Aantal kamers') which only returns
Output: []
[]
[]
[]
[]
[]
Posts: 7,086
Threads: 122
Joined: Sep 2016
Feb-15-2017, 03:48 PM
(This post was last modified: Feb-15-2017, 03:50 PM by snippsat.)
The make your error,size is a string.
>>> size = ''
>>> size.find_next_sibling('span') Error: Traceback (most recent call last):
File "<string>", line 301, in runcode
File "<interactive input>", line 1, in <module>
AttributeError: 'str' object has no attribute 'find_next_sibling'
Try to post better formatted HTML,Both CodePen and JSFiddle has Tidy HTML function.
One way to do it:
from bs4 import BeautifulSoup
html = '''\
<ul class="properties-list">
<li>
3067 JH Rotterdam
<span class="item-sold-label-small" title="Verkocht">Verkocht</span>
</li>
<li>
<span title="Woonoppervlakte">90 m²</span>
<span title="Aantal kamers">4 kamers</span>
</li>'''
soup = BeautifulSoup(html, 'lxml')
p_lst = soup.find(class_="properties-list")
span = p_lst.select('li > span')
print([item.text for item in span[1:]]) Output: ['90 m²', '4 kamers']
Posts: 42
Threads: 10
Joined: Feb 2017
Feb-15-2017, 04:01 PM
(This post was last modified: Feb-15-2017, 04:01 PM by takaa.)
and one more questions.
I also want to get a clean street name, without the numbers and other additions.
"title" returns the street name and house number etc. the code I tried is:
street = title.rpartition(' ')[0] When the address is " street + number" this gives the desired output, but when the address is " street + number + addition" this gives me the street name AND the number.
Basically what i need is to get the string before the first space that is followed up by a number.
But this I am not yet able to produce.
(Feb-15-2017, 03:48 PM)snippsat Wrote: One way to do it:
Thanks for the new insight.
Is my thinking of just trying to take the text from each span with the title='Aantal kamers' not practical to code? It seemed so logical to just add somehow the criteria for title, but I didn't manage to do it.
Posts: 5,150
Threads: 396
Joined: Sep 2016
there are multiple ways to do the same thing
Recommended Tutorials:
|