Feb-08-2017, 12:22 PM
Tnx for the great help. I have everything i want now!
I currently write my output to a csv file which i can then work with. The final piece of my puzzle is to get a header on the first row of the csv file. In the code i now i have i start with completely emptying the csv file.
I tried both to only empty row 2 downwards, or to print the headers with my output. But in both cases it didn't result in a satisfactory output.
See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!
I currently write my output to a csv file which i can then work with. The final piece of my puzzle is to get a header on the first row of the csv file. In the code i now i have i start with completely emptying the csv file.
I tried both to only empty row 2 downwards, or to print the headers with my output. But in both cases it didn't result in a satisfactory output.
See my code below. Does anybody know how i can print headers only in the first row of the csv file? Many thanks!
import requests from bs4 import BeautifulSoup open('output.csv', 'w').close() import re def fundaSpider(max_pages): page = 1 while page <= max_pages: url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') ads = soup.find_all('li', {'class': 'search-result'}) for ad in ads: title = ad.find('h3') title = ' '.join(title.get_text(separator='\n', strip=True).split()[ :-3]) # sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin address = ad.find('small').text.strip() price = ad.find('div', {'class': 'search-result-info search-result-info-price'}).text.strip() price = re.findall(r'\d', price) price = ''.join(price) size_results = ad.find('ul', {'class': 'search-result-kenmerken'}) li = size_results.find_all('li') size = li[0] size = size.get_text(strip=True) size = size.split(" ")[0] room = li[1].text.strip() room = room.split(" ")[0] href = ('http://www.funda.nl' + ad.find_all('a')[2]['href']) area = get_single_item_data(href) print(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href) saveFile = open('output.csv', 'a') saveFile.write(title + "," + address + "," + price + "," + size + "," + room + "," + area + "," + href + '\n') saveFile.close() page += 1 def get_single_item_data(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') li = soup.find_all('li', {'class': 'breadcrumb-listitem'}) return (li[2].a.text) fundaSpider(1)