Python Forum
Thread Rating:
  • 1 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Crawler help
#11
Tnx for the great help. I have everything i want now!
I currently write my output to a csv file which i can then work with. The final piece of my puzzle is to get a header on the first row of the csv file. In the code i now i have i start with completely emptying the csv file. 

I tried both to only empty row 2 downwards, or to print the headers with my output. But in both cases it didn't result in a satisfactory output. 

See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!

import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n', strip=True).split()[
                             :-3])  # sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
            price = re.findall(r'\d', price)
            price = ''.join(price)
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1].text.strip()
            room = room.split(" ")[0]
            href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            area = get_single_item_data(href)
            print(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href)
            saveFile = open('output.csv', 'a')
            saveFile.write(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href + '\n')
            saveFile.close()
 
        page += 1
 
def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
    return (li[2].a.text)
 
 
fundaSpider(1)
Reply
#12
Actually, I run into a little problem with the code above.

For one listing the number of rooms was not given. The code returned the following error:

Error:
    room = li[1].text.strip() IndexError: list index out of range
What is an elegant way of dealing with cases in which the requested info (in this case the rooms) is not listed?
Reply
#13
try:
    room = li[1].text.strip()
except IndexError:
    room = 'Unknown'
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#14
Quote:See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!
csv is a little messy and hard to work with this kind of data.
I would put it in a dictionary structure,the can use json to serialize to and from disk.
Eg.
import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re

def fundaSpider(max_pages):
    page = 1
    d = {}
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n', strip=True).split()[
                             :-3])  
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
            price = re.findall(r'\d', price)
            price = ''.join(price)
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1].text.strip()
            room = room.split(" ")[0]
            href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            area = get_single_item_data(href)
            d[title] = address,price,href
            print(d)
        page += 1

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
    return (li[2].a.text)
fundaSpider(1)
So the structure can be chosen in many ways.
Here i choose title as key and rest as tuple.
Eg:
>>> d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
>>> d['Scottstraat 3']
('3076 GX Rotterdam',
 '165000',
 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')

>>> d['Scottstraat 3'][2]
'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/'
The advantages now is that can use json and serialize to disk.
import json

d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
with open("my_file.json", "w") as j_in:
  json.dump(d, j_in)
with open("my_file.json") as j_out:
  saved_data = json.load(j_out)
Output:
# It come out as the same working dictionary print(saved_data) {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
As @wavic suggests is fine.
Also annoying errors in web-scraping can be passed out.
As long this not pass out data that can be needed Hand
try:
   room = li[1].text.strip()
except IndexError:
   pass
Reply
#15
Great Help! Everything runs beautiful now. 

I do have another general question, i am adding more data to the crawler, but for one item i am not able to extract the piece of info that i want. The HTML code looks as follows:

<div class="object-kenmerken-body" data-object-kenmerken-body="" style="height: 416px;">
                    <h3 class="object-kenmerken-list-header">Overdracht</h3>
                <dl class="object-kenmerken-list">


<dt>Vraagprijs</dt>
<dd>€ 200.000 k.k.</dd>
<dt>Aangeboden sinds</dt>
<dd>3 maanden
</dd>
<dt>Status</dt>
<dd>Beschikbaar
</dd>
<dt>Aanvaarding</dt>
<dd>In overleg
</dd>
<dt>Bijdrage VvE</dt>
<dd>€ 142 per maand
</dd>

                </dl>

Text in red I want to extract

    dl = soup.find_all('dl', {'class': 'object-kenmerken-list'})

    print(dl[0].dd.text)
Output:
€ 200.000 k.k.
how can I address the second <dd>? I tried the below but that got an error.

print(dl[0].dd[1].text)
Tnx for the help!
Reply
#16
Quote:how can I address the second <dd>?
The same way i have been showing you....
you can do another find_all() on dd tag. 
dl = soup.find('dl', {'class': 'object-kenmerken-list'})
print(dl.find_all('dd')[1].text.strip())
However to broaden your horizons....you can also do next sibling if you dont want to use find_all()
dl = soup.find('dl', {'class': 'object-kenmerken-list'})
print(dl.dd.find_next_sibling('dd').text.strip())
Recommended Tutorials:
Reply
#17
(Feb-09-2017, 11:46 AM)metulburr Wrote:
Quote:how can I address the second <dd>?
you can do another find_all() on dd tag. so dl.find_all('dd')[1]

**** Ignore comment below, i see your initial reply is changed. Going to work on that. *****

I don't follow completely. 

I tried different variants but i keep getting:

Error:
AttributeError: 'ResultSet' object has no attribute 'find_all'
    dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
    dl = dl.find_all('dd')[1]
 
    print(dl[0].dd.text)
    dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
 
    print(dl.find_all('dd')[1].text)
It would be great if you can help me understand your suggestion better.
Reply
#18
Look closely ... The soup.find should be looking for 'dl' not 'dd'
Reply
#19
check my edited post ^

yeah your code is typod here
Quote:
dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
dl != dd
Recommended Tutorials:
Reply
#20
Another question, this time about None Type.

I made another function in the crawler to get another piece of info. 

def get_single_item_data_3(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    uls = soup.find_all('ul', {'class': 'labels'})
    for ul in uls:
            return(ul.find('li').text.strip())
If i only print out this function the result is:
Output:
Nieuw Nieuw None None None None None None None
If i put this output in the string with my other results i get an error for the first " None" return:
Error:
TypeError: Can't convert 'NoneType' object to str implicitly
I tried several things to only let the function return a result if the result is not None. But without success. For example, if i call the function like this:

            if get_single_item_data_3(href) is not None:
                status = get_single_item_data_3(href)
            print(status)
the result is on each row " nieuw" (output of the first item). 

If i put the if statatement in the function like this:

def get_single_item_data_3(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    uls = soup.find_all('ul', {'class': 'labels'})
    for ul in uls:
        if get_single_item_data_3(item_url) is not None:
            return(ul.find('li').text.strip())
Nothing is happening (i interrupted). 
Error:
     File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 228, in __init__     self._feed()   File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 289, in _feed     self.builder.feed(self.markup)   File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 167, in feed     parser.feed(markup)   File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 111, in feed     self.goahead(0)   File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 173, in goahead     k = self.parse_endtag(i)   File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 422, in parse_endtag     self.clear_cdata_mode()   File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 128, in clear_cdata_mode     self.interesting = interesting_normal KeyboardInterrupt
How can I print the output of my function (def get_single_item_data_3) in a string together with my other outputs, while ignorning the items that have a NoneType?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web Crawler help Mr_Mafia 2 1,900 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020