Web Crawler help

takaa · Feb-08-2017, 12:22 PM

Tnx for the great help. I have everything i want now!
I currently write my output to a csv file which i can then work with. The final piece of my puzzle is to get a header on the first row of the csv file. In the code i now i have i start with completely emptying the csv file.

I tried both to only empty row 2 downwards, or to print the headers with my output. But in both cases it didn't result in a satisfactory output.

See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!

import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n', strip=True).split()[
                             :-3])  # sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
            price = re.findall(r'\d', price)
            price = ''.join(price)
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1].text.strip()
            room = room.split(" ")[0]
            href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            area = get_single_item_data(href)
            print(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href)
            saveFile = open('output.csv', 'a')
            saveFile.write(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href + '\n')
            saveFile.close()
 
        page += 1
 
def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
    return (li[2].a.text)
 
 
fundaSpider(1)

takaa · Feb-08-2017, 01:31 PM

Actually, I run into a little problem with the code above.

For one listing the number of rooms was not given. The code returned the following error:

Error:    room = li[1].text.strip()

IndexError: list index out of range

What is an elegant way of dealing with cases in which the requested info (in this case the rooms) is not listed?

wavic · Feb-08-2017, 01:47 PM

try:
    room = li[1].text.strip()
except IndexError:
    room = 'Unknown'

***snippsat*** · (This post was last modified: Feb-08-2017, 02:19 PM by snippsat.)

Quote:See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!

csv is a little messy and hard to work with this kind of data.
I would put it in a dictionary structure,the can use json to serialize to and from disk.
Eg.

import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re

def fundaSpider(max_pages):
    page = 1
    d = {}
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n', strip=True).split()[
                             :-3])  
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
            price = re.findall(r'\d', price)
            price = ''.join(price)
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1].text.strip()
            room = room.split(" ")[0]
            href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            area = get_single_item_data(href)
            d[title] = address,price,href
            print(d)
        page += 1

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
    return (li[2].a.text)
fundaSpider(1)

So the structure can be chosen in many ways.
Here i choose title as key and rest as tuple.
Eg:

>>> d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
>>> d['Scottstraat 3']
('3076 GX Rotterdam',
 '165000',
 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')

>>> d['Scottstraat 3'][2]
'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/'

The advantages now is that can use json and serialize to disk.

import json

d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
with open("my_file.json", "w") as j_in:
  json.dump(d, j_in)
with open("my_file.json") as j_out:
  saved_data = json.load(j_out)

Output:# It come out as the same working dictionary 
print(saved_data)
{'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}

As @wavic suggests is fine.
Also annoying errors in web-scraping can be passed out.
As long this not pass out data that can be needed Hand

try:
   room = li[1].text.strip()
except IndexError:
   pass

takaa · Feb-09-2017, 11:16 AM

Great Help! Everything runs beautiful now.

I do have another general question, i am adding more data to the crawler, but for one item i am not able to extract the piece of info that i want. The HTML code looks as follows:

<div class="object-kenmerken-body" data-object-kenmerken-body="" style="height: 416px;">
<h3 class="object-kenmerken-list-header">Overdracht</h3>
<dl class="object-kenmerken-list">

<dt>Vraagprijs</dt>
<dd>€ 200.000 k.k.</dd>
<dt>Aangeboden sinds</dt>
<dd>3 maanden
</dd>
<dt>Status</dt>
<dd>Beschikbaar
</dd>
<dt>Aanvaarding</dt>
<dd>In overleg
</dd>
<dt>Bijdrage VvE</dt>
<dd>€ 142 per maand
</dd>

</dl>

Text in red I want to extract

    dl = soup.find_all('dl', {'class': 'object-kenmerken-list'})

    print(dl[0].dd.text)

Output:
€ 200.000 k.k.

how can I address the second <dd>? I tried the below but that got an error.

print(dl[0].dd[1].text)

Tnx for the help!

***metulburr*** · (This post was last modified: Feb-09-2017, 12:08 PM by metulburr.)

Quote:how can I address the second <dd>?

The same way i have been showing you....
you can do another find_all() on dd tag.

dl = soup.find('dl', {'class': 'object-kenmerken-list'})
print(dl.find_all('dd')[1].text.strip())

However to broaden your horizons....you can also do next sibling if you dont want to use find_all()

dl = soup.find('dl', {'class': 'object-kenmerken-list'})
print(dl.dd.find_next_sibling('dd').text.strip())

takaa · (This post was last modified: Feb-09-2017, 12:08 PM by takaa.)

(Feb-09-2017, 11:46 AM)metulburr Wrote:
Quote:how can I address the second <dd>?
you can do another find_all() on dd tag. so dl.find_all('dd')[1]

**** Ignore comment below, i see your initial reply is changed. Going to work on that. *****

I don't follow completely.

I tried different variants but i keep getting:

Error:
AttributeError: 'ResultSet' object has no attribute 'find_all'

    dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
    dl = dl.find_all('dd')[1]
 
    print(dl[0].dd.text)

    dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})
 
    print(dl.find_all('dd')[1].text)

It would be great if you can help me understand your suggestion better.

**Larz60+** · Feb-09-2017, 12:10 PM

Look closely ... The soup.find should be looking for 'dl' not 'dd'

***metulburr*** · (This post was last modified: Feb-09-2017, 12:15 PM by metulburr.)

check my edited post ^

yeah your code is typod here

Quote:

dl = soup.find_all('dd', {'class': 'object-kenmerken-list'})

dl != dd

takaa · (This post was last modified: Feb-10-2017, 12:24 PM by takaa.)

Another question, this time about None Type.

I made another function in the crawler to get another piece of info.

def get_single_item_data_3(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    uls = soup.find_all('ul', {'class': 'labels'})
    for ul in uls:
            return(ul.find('li').text.strip())

If i only print out this function the result is:

Output:Nieuw
Nieuw
None
None
None
None
None
None
None

If i put this output in the string with my other results i get an error for the first " None" return:

Error:
TypeError: Can't convert 'NoneType' object to str implicitly

I tried several things to only let the function return a result if the result is not None. But without success. For example, if i call the function like this:

            if get_single_item_data_3(href) is not None:
                status = get_single_item_data_3(href)
            print(status)

the result is on each row " nieuw" (output of the first item).

If i put the if statatement in the function like this:

def get_single_item_data_3(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    uls = soup.find_all('ul', {'class': 'labels'})
    for ul in uls:
        if get_single_item_data_3(item_url) is not None:
            return(ul.find('li').text.strip())

Nothing is happening (i interrupted).

Error:  
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 228, in __init__
    self._feed()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 167, in feed
    parser.feed(markup)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 173, in goahead
    k = self.parse_endtag(i)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 422, in parse_endtag
    self.clear_cdata_mode()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/parser.py", line 128, in clear_cdata_mode
    self.interesting = interesting_normal
KeyboardInterrupt

How can I print the output of my function (def get_single_item_data_3) in a string together with my other outputs, while ignorning the items that have a NoneType?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,900	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia

Web Crawler help

User Panel Messages

Announcements