Posts: 42
Threads: 10
Joined: Feb 2017
Tnx for the great help. I have everything i want now!
I currently write my output to a csv file which i can then work with. The final piece of my puzzle is to get a header on the first row of the csv file. In the code i now i have i start with completely emptying the csv file.
I tried both to only empty row 2 downwards, or to print the headers with my output. But in both cases it didn't result in a satisfactory output.
See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!
import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re
def fundaSpider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
ads = soup.find_all('li', {'class': 'search-result'})
for ad in ads:
title = ad.find('h3')
title = ' '.join(title.get_text(separator='\n', strip=True).split()[
:-3]) # sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin
address = ad.find('small').text.strip()
price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
price = re.findall(r'\d', price)
price = ''.join(price)
size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
li = size_results.find_all('li')
size = li[0]
size = size.get_text(strip=True)
size = size.split(" ")[0]
room = li[1].text.strip()
room = room.split(" ")[0]
href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
area = get_single_item_data(href)
print(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href)
saveFile = open('output.csv', 'a')
saveFile.write(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href + '\n')
saveFile.close()
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
return (li[2].a.text)
fundaSpider(1)
Posts: 42
Threads: 10
Joined: Feb 2017
Actually, I run into a little problem with the code above.
For one listing the number of rooms was not given. The code returned the following error:
Error: room = li[1].text.strip()
IndexError: list index out of range
What is an elegant way of dealing with cases in which the requested info (in this case the rooms) is not listed?
Posts: 2,953
Threads: 48
Joined: Sep 2016
try:
room = li[1].text.strip()
except IndexError:
room = 'Unknown'
Posts: 7,324
Threads: 123
Joined: Sep 2016
Feb-08-2017, 02:19 PM
(This post was last modified: Feb-08-2017, 02:19 PM by snippsat.)
Quote:See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!
csv is a little messy and hard to work with this kind of data.
I would put it in a dictionary structure,the can use json to serialize to and from disk.
Eg.
import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re
def fundaSpider(max_pages):
page = 1
d = {}
while page <= max_pages:
url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
ads = soup.find_all('li', {'class': 'search-result'})
for ad in ads:
title = ad.find('h3')
title = ' '.join(title.get_text(separator='\n', strip=True).split()[
:-3])
address = ad.find('small').text.strip()
price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
price = re.findall(r'\d', price)
price = ''.join(price)
size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
li = size_results.find_all('li')
size = li[0]
size = size.get_text(strip=True)
size = size.split(" ")[0]
room = li[1].text.strip()
room = room.split(" ")[0]
href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
area = get_single_item_data(href)
d[title] = address,price,href
print(d)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
return (li[2].a.text)
fundaSpider(1) So the structure can be chosen in many ways.
Here i choose title as key and rest as tuple.
Eg:
>>> d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
>>> d['Scottstraat 3']
('3076 GX Rotterdam',
'165000',
'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')
>>> d['Scottstraat 3'][2]
'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/' The advantages now is that can use json and serialize to disk.
import json
d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
with open("my_file.json", "w") as j_in:
json.dump(d, j_in)
with open("my_file.json") as j_out:
saved_data = json.load(j_out) Output: # It come out as the same working dictionary
print(saved_data)
{'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
As @ wavic suggests is fine.
Also annoying errors in web-scraping can be passed out.
As long this not pass out data that can be needed
try:
room = li[1].text.strip()
except IndexError:
pass
Posts: 42
Threads: 10
Joined: Feb 2017
Great Help! Everything runs beautiful now.
I do have another general question, i am adding more data to the crawler, but for one item i am not able to extract the piece of info that i want. The HTML code looks as follows:
<div class="object-kenmerken-body" data-object-kenmerken-body="" style="height: 416px;">
<h3 class="object-kenmerken-list-header">Overdracht</h3>
<dl class="object-kenmerken-list">
<dt>Vraagprijs</dt>
<dd>€ 200.000 k.k.</dd>
<dt>Aangeboden sinds</dt>
<dd> 3 maanden
</dd>
<dt>Status</dt>
<dd>Beschikbaar
</dd>
<dt>Aanvaarding</dt>
<dd>In overleg
</dd>
<dt>Bijdrage VvE</dt>
<dd>€ 142 per maand
</dd>
</dl>
Text in red I want to extract
dl = soup.find_all('dl', {'class': 'object-kenmerken-list'})
print(dl[0].dd.text) Output: € 200.000 k.k.
how can I address the second <dd>? I tried the below but that got an error.
print(dl[0].dd[1].text) Tnx for the help!
Posts: 5,151
Threads: 396
Joined: Sep 2016
Feb-09-2017, 12:07 PM
(This post was last modified: Feb-09-2017, 12:08 PM by metulburr.)
Quote:how can I address the second <dd>?
The same way i have been showing you....
you can do another find_all() on dd tag.
dl = soup.find('dl', {'class': 'object-kenmerken-list'})
print(dl.find_all('dd')[1].text.strip()) However to broaden your horizons....you can also do next sibling if you dont want to use find_all()
dl = soup.find('dl', {'class': 'object-kenmerken-list'})
print(dl.dd.find_next_sibling('dd').text.strip())
Recommended Tutorials:
Posts: 42
Threads: 10
Joined: Feb 2017
Feb-09-2017, 12:08 PM
(This post was last modified: Feb-09-2017, 12:08 PM by takaa.)
(Feb-09-2017, 11:46 AM)metulburr Wrote: Quote:how can I address the second <dd>?
you can do another find_all() on dd tag. so dl.find_all('dd')[1]
**** Ignore comment below, i see your initial reply is changed. Going to work on that. *****
I don't follow completely.
I tried different variants but i keep getting:
Error: AttributeError: 'ResultSet' object has no attribute 'find_all'
dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
dl = dl.find_all('dd')[1]
print(dl[0].dd.text) dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
print(dl.find_all('dd')[1].text) It would be great if you can help me understand your suggestion better.
Posts: 12,040
Threads: 487
Joined: Sep 2016
Look closely ... The soup.find should be looking for 'dl' not 'dd'
Posts: 5,151
Threads: 396
Joined: Sep 2016
Feb-09-2017, 12:14 PM
(This post was last modified: Feb-09-2017, 12:15 PM by metulburr.)
check my edited post ^
yeah your code is typod here
Quote:dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
dl != dd
Recommended Tutorials:
Posts: 42
Threads: 10
Joined: Feb 2017
Feb-10-2017, 12:24 PM
(This post was last modified: Feb-10-2017, 12:24 PM by takaa.)
Another question, this time about None Type.
I made another function in the crawler to get another piece of info.
def get_single_item_data_3(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
uls = soup.find_all('ul', {'class': 'labels'})
for ul in uls:
return(ul.find('li').text.strip()) If i only print out this function the result is:
Output: Nieuw
Nieuw
None
None
None
None
None
None
None
If i put this output in the string with my other results i get an error for the first " None" return:
Error: TypeError: Can't convert 'NoneType' object to str implicitly
I tried several things to only let the function return a result if the result is not None. But without success. For example, if i call the function like this:
if get_single_item_data_3(href) is not None:
status = get_single_item_data_3(href)
print(status) the result is on each row " nieuw" (output of the first item).
If i put the if statatement in the function like this:
def get_single_item_data_3(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
uls = soup.find_all('ul', {'class': 'labels'})
for ul in uls:
if get_single_item_data_3(item_url) is not None:
return(ul.find('li').text.strip()) Nothing is happening (i interrupted).
Error:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 228, in __init__
self._feed()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 167, in feed
parser.feed(markup)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 111, in feed
self.goahead(0)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 173, in goahead
k = self.parse_endtag(i)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 422, in parse_endtag
self.clear_cdata_mode()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 128, in clear_cdata_mode
self.interesting = interesting_normal
KeyboardInterrupt
How can I print the output of my function (def get_single_item_data_3) in a string together with my other outputs, while ignorning the items that have a NoneType?
|