![]() |
Web Crawler help - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Web Crawler help (/thread-1957.html) |
Web Crawler help - takaa - Feb-06-2017 Hi folks, I'm new to python and to this forum. Background, I started coding recently to make my own life easier by automating as much in my life as possible. As a result, I don't have much experience but I am doing my best to catch up. I made a web crawler to extract info about houses for sale. Each page on the housing site contains 15 houses. (note that i have added some spaces in the web url, otherwise i could not make a forum post) import requests from bs4 import BeautifulSoup def fundaSpider(max_pages): page = 1 while page <= max_pages: url = 'ht tp://www. funda. nl/koop/rotterdam/p' + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for x in soup.find_all('h3', {"class": "search-result-title"}): location = x.get_text(strip=True) for y in soup.find_all('div', {'class':'search-result-info search-result-info-price'}): price = y.get_text(strip=True) for z in soup.find_all('ul', {'class': 'search-result-kenmerken'}): size = z.get_text(strip=True) print(location +";" + price +";"+ size) page += 1 fundaSpider(2)The output of this code returns the "location" and the " price" of the first listning for each of the 15 listings per page and the " size" correctly for each of the 15 listings per page. Below I have included the HTML code of the webpage for one of the listings. My questions:1. My output is not correct, I should get the location price and size returned for each listing per page. I know that probably my 3 " for" loops are not correct, Ii have tried several things but I am getting many different variants of solutions that are not correct. 2. Currently the information for "Location" and " Size" comes together. Location: Slaak 1123061 CZ Rotterdam; Price: € 175.000 k.k.; Size: 135 m²/171 m²5 kamers I would prefer to extract this separately: Slaak 112 3061 CZ Rotterdam € 175.000 k.k. 135 m²/171 m² 5 kamers Any tips are welcome. All help is much appreciated! RE: Web Crawler help - wavic - Feb-06-2017 In [73]: for i, item in enumerate(soup.find_all('div', class_='search-result-content-inner'), 1): ...: print(item.find('h3').text.strip().split('\n')[0].strip()) ...: print(item.h3.small.text.strip(), '\n') ...: ...: Scottstraat 3 3076 GX Rotterdam Jan Meertensstraat 3 3065 PB Rotterdam Yersekestraat 36 3086 SG Rotterdam Korendijk 129 3079 PW Rotterdam Amer 6 3068 GA Rotterdam Port-Saidstraat 150 3067 MV Rotterdam John Mottweg 100 3069 VT Rotterdam Brandingdijk 276 3059 RB Rotterdam Abraham van der Knaapkade 9 3059 SP Rotterdam Koenraad van Zwabenstraat 40 3077 WJ Rotterdam Jeneverbes 6 3069 LP Rotterdam Lamastraat 56 3064 LL Rotterdam Zeeuwsestraat 2 + PP 3074 TT Rotterdam Oudedijk 187 A 3061 AD Rotterdam Slaak 112 3061 CZ Rotterdam RE: Web Crawler help - metulburr - Feb-06-2017 Quote:Because you have 3 loops for each, and print on your last loop, it only grabs the other previous loops last value, thus why they are all the same. You should grab the main tag of the ad and then iterate those tags anyways.for x in soup.find_all('h3', {"class": "search-result-title"}): location = x.get_text(strip=True) for y in soup.find_all('div', {'class':'search-result-info search-result-info-price'}): price = y.get_text(strip=True) for z in soup.find_all('ul', {'class': 'search-result-kenmerken'}): size = z.get_text(strip=True) print(location +";" + price +";"+ size) Also you should be using format method regardless of your python version I made a quick script extracting the data you requested. Some sections were tinkering back and forth to tweak it to get it as expected. import requests from bs4 import BeautifulSoup SEP = '*'*80 def fundaSpider(max_pages): page = 1 while page <= max_pages: url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') ads = soup.find_all('li',{'class':'search-result'}) for ad in ads: title = ad.find('h3') title = ' '.join(title.get_text(separator='\n',strip=True).split()[:-3]) #sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin address = ad.find('small').text.strip() price = ad.find('div', {'class':'search-result-info search-result-info-price'}).text.strip() size_results = ad.find('ul', {'class': 'search-result-kenmerken'}) li = size_results.find_all('li') size = li[0] size = size.get_text(strip=True) room = li[1].text.strip() print('title: {}\naddress: {}\nprice: {}\nsize: {}\nrooms: {}'.format(title, address, price, size, room)) print(SEP) page += 1 fundaSpider(2) RE: Web Crawler help - takaa - Feb-07-2017 Thank you very much for the very helpful replies! Really motivating to get this help, keeps the learning curve going! RE: Web Crawler help - wavic - Feb-07-2017 Why I am using enumerate here...! ![]() It's a remainder of counting the returned results. ![]() RE: Web Crawler help - takaa - Feb-07-2017 thanks to your help i have been able to extract the data and export export everything to a csv file. Offcourse this now tasts like more and i would like to access the url's of the individual listings to extract more detailed info for each of the listings. In the main source code this means i need to scrap this url for each of the listings (see blue) </div> </a> </div> <div class="search-result-content"> <div class="search-result-content-inner"> <div class="search-result-header"> <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249"> <h3 class="search-result-title"> Scottstraat 3 <small class="search-result-subtitle"> 3076 GX Rotterdam </small> </h3> </a> </div> <div class="search-result-info search-result-info-price"> <span class="search-result-price">€ 165.000 k.k.</span> </div> <div class="search-result-info"> <ul class="search-result-kenmerken "> <li> <span title="Woonoppervlakte">67 m²</span> / <span title="Perceeloppervlakte">138 m²</span> </li> <li>3 kamers</li> </ul> </div> I am using the code as posted by metulburr and added this line below "room" href = ad.find('a', {'class': 'search-result-header'}).link.get('href', {})but then i get the following error message: i have tried several things, for examplehref = 'www.funda.nl' + ad.find('a')but non-successful so far. As always, help is much appreciated! RE: Web Crawler help - snippsat - Feb-07-2017 It's div and class first then,find a with href .from bs4 import BeautifulSoup html = '''\ <div class="search-result-header"> <a href="/koop/rotterdam/huis-85488249-scottstraat-3/" data-search-result-item-anchor="85488249"> <h3 class="search-result-title">Scottstraat 3 <small class="search-result-subtitle">3076 GX Rotterdam</small> </h3> </a> </div>''' soup = BeautifulSoup(html, 'html.parser') result = soup.find('div', class_="search-result-header") link = result.find('a').get('href') print('www.funda.nl{}'.format(link))
RE: Web Crawler help - metulburr - Feb-07-2017 you can use ad.find_all('a')[2]['href']because its actually the 3rd, not the first a tag to get the path, and then join with with urllib.parse.urljoin RE: Web Crawler help - takaa - Feb-07-2017 Thanks a million! I have added a new def that searches through the pages of the individual listings: def get_single_item_data(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for item in soup.findAll('li', {'class': 'breadcrumb-listitem'} ): area = item.find('a', ) print(area)I try to get one specific piece of information (neighborhood). I succeeded in getting the info I am looking for in the output, but with a lot of stuff that I don't want. (i only want the "title" (in the first example " Lombardijen") the output for each listing is as follows (only copied the result for the first 3 listings). The HTML code for each listing is looking as follows, I have made red what I tried to extract.<div class="breadcrumb"> <ol class="container breadcrumb-list"> <li class="breadcrumb-listitem"> <a href="/koop/" title="Home">Home</a> <span class="icon-arrow-right-grey"></span> </li> <li class="breadcrumb-listitem"> <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a> <span class="icon-arrow-right-grey"></span> </li> <li class="breadcrumb-listitem"> <a href="/koop/rotterdam/lombardijen/" title="Lombardijen">Lombardijen</a> <span class="icon-arrow-right-grey"></span> </li> <li class="breadcrumb-listitem"> <span title="Scottstraat 3">Scottstraat 3</span> </li> </ol> Hope this part of my puzzle can also be solved. Tnx again for the help RE: Web Crawler help - metulburr - Feb-08-2017 import requests from bs4 import BeautifulSoup html = ''' <div class="breadcrumb"> <ol class="container breadcrumb-list"> <li class="breadcrumb-listitem"> <a href="/koop/" title="Home">Home</a> <span class="icon-arrow-right-grey"></span> </li> <li class="breadcrumb-listitem"> <a href="/koop/rotterdam/" title="Rotterdam">Rotterdam</a> <span class="icon-arrow-right-grey"></span> </li> <li class="breadcrumb-listitem"> <a href="/koop/rotterdam/lombardijen/" title="Lombardijen">Lombardijen</a> <span class="icon-arrow-right-grey"></span> </li> <li class="breadcrumb-listitem"> <span title="Scottstraat 3">Scottstraat 3</span> </li> </ol> ''' soup = BeautifulSoup(html, 'html.parser') li = soup.find_all('li', {'class': 'breadcrumb-listitem'} ) print(li[2].a.text)EDIT: oh whoops thats the title not text print(li[2].a['title']) |