Web Crawler help

***snippsat*** · (This post was last modified: Feb-08-2017, 02:19 PM by snippsat.)

Quote:See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!

csv is a little messy and hard to work with this kind of data.
I would put it in a dictionary structure,the can use json to serialize to and from disk.
Eg.

import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re

def fundaSpider(max_pages):
    page = 1
    d = {}
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n', strip=True).split()[
                             :-3])  
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip()
            price = re.findall(r'\d', price)
            price = ''.join(price)
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            size = li[0]
            size = size.get_text(strip=True)
            size = size.split(" ")[0]
            room = li[1].text.strip()
            room = room.split(" ")[0]
            href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            area = get_single_item_data(href)
            d[title] = address,price,href
            print(d)
        page += 1

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
    return (li[2].a.text)
fundaSpider(1)

So the structure can be chosen in many ways.
Here i choose title as key and rest as tuple.
Eg:

>>> d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
>>> d['Scottstraat 3']
('3076 GX Rotterdam',
 '165000',
 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')

>>> d['Scottstraat 3'][2]
'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/'

The advantages now is that can use json and serialize to disk.

import json

d = {'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}
with open("my_file.json", "w") as j_in:
  json.dump(d, j_in)
with open("my_file.json") as j_out:
  saved_data = json.load(j_out)

Output:# It come out as the same working dictionary 
print(saved_data)
{'Scottstraat 3': ('3076 GX Rotterdam', '165000', 'http://www.funda.nl/koop/rotterdam/huis-85488249-scottstraat-3/')}

As @wavic suggests is fine.
Also annoying errors in web-scraping can be passed out.
As long this not pass out data that can be needed Hand

try:
   room = li[1].text.strip()
except IndexError:
   pass

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	2,047	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia

Web Crawler help

User Panel Messages

Announcements