Jan-14-2017, 01:01 AM
hey guys, i just finished the crawler project for the romanian ads website, here's the source code for anyone who wants to play with it; you guys helped me a great deal with it so i feel it belongs to you as much as it belongs to me:
thanks again
import requests from bs4 import BeautifulSoup from selenium import webdriver import time import os def crawler(max_pages): page = 1 while page <= max_pages: url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for link in soup.findAll('a', {'itemprop':'name'}): rawhref = 'http://www.publi24.ro' + link.get('href') href = rawhref.replace('http://www.publi24.rohttp://www.publi24.ro','http://www.publi24.ro') ad_title(href) views(href) print(href) page += 1 def ad_title(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for item_name in soup.findAll('h1', {'itemprop':'name'}): print(item_name.string) def get_adid(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for link in soup.findAll('span', {'class':'fa fa-eye'}): adid = link.get('ng-init') adid_num = adid.split('=')[1] return adid_num def views(item_url): # browser = webdriver.PhantomJS(r'D:\phantomjs-2.1.1-windows\bin\phantomjs.exe') chromedriver = r'D:\chromedriver_win32\chromedriver.exe' os.environ['webdriver.chrome.driver'] = chromedriver browser = webdriver.Chrome(chromedriver) browser.get(item_url) time.sleep(1) soup = BeautifulSoup(browser.page_source, 'html.parser') dict = {'add-view':'1'} dict['add-view'] = get_adid(item_url) tag = soup.find('span', dict) print(tag.text) browser.quit() crawler(1)now i'm working on a project based on this one where i can add persistence to the output and store it into something (i'm leaning towards excel cuz i've been working with it 'manually' so far) but i need to learn more and play with it before i have any real questions but for now i think this topic can be closed.
thanks again