web crawler that retrieves data not stored in source code

edithegodfather · Jan-14-2017, 01:01 AM

hey guys, i just finished the crawler project for the romanian ads website, here's the source code for anyone who wants to play with it; you guys helped me a great deal with it so i feel it belongs to you as much as it belongs to me:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

def crawler(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           rawhref = 'http://www.publi24.ro' + link.get('href')
           href = rawhref.replace('http://www.publi24.rohttp://www.publi24.ro','http://www.publi24.ro')
           ad_title(href)
           views(href)
           print(href)
       page += 1

def ad_title(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for item_name in soup.findAll('h1', {'itemprop':'name'}):
       print(item_name.string)

def get_adid(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for link in soup.findAll('span', {'class':'fa fa-eye'}):
       adid = link.get('ng-init')
       adid_num = adid.split('=')[1]
       return adid_num

def views(item_url):
   # browser = webdriver.PhantomJS(r'D:\phantomjs-2.1.1-windows\bin\phantomjs.exe')
   chromedriver = r'D:\chromedriver_win32\chromedriver.exe'
   os.environ['webdriver.chrome.driver'] = chromedriver
   browser = webdriver.Chrome(chromedriver)
   browser.get(item_url)
   time.sleep(1)
   soup = BeautifulSoup(browser.page_source, 'html.parser')
   dict = {'add-view':'1'}
   dict['add-view'] = get_adid(item_url)
   tag = soup.find('span', dict)
   print(tag.text)
   browser.quit()

crawler(1)

now i'm working on a project based on this one where i can add persistence to the output and store it into something (i'm leaning towards excel cuz i've been working with it 'manually' so far) but i need to learn more and play with it before i have any real questions but for now i think this topic can be closed.

thanks again

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Hide source code from python process itself	xmghe	2	1,876	Jan-27-2021, 04:04 PM Last Post: xmghe
	Web Crawler help	Mr_Mafia	2	1,887	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	scraping from a website that hides source code	PIWI_Protein	1	1,965	Mar-27-2020, 05:08 PM Last Post: Larz60+
	Web Crawler help	takaa	39	27,225	Apr-26-2019, 12:14 PM Last Post: stateitreal
	Python requests.get() returns broken source code instead of expected source code?	FatalPythonError	3	3,719	Sep-21-2018, 02:46 PM Last Post: nilamo

web crawler that retrieves data not stored in source code

User Panel Messages

Announcements