Python Forum
web crawler that retrieves data not stored in source code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web crawler that retrieves data not stored in source code
#15
hey guys, i just finished the crawler project for the romanian ads website, here's the source code for anyone who wants to play with it; you guys helped me a great deal with it so i feel it belongs to you as much as it belongs to me:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

def crawler(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           rawhref = 'http://www.publi24.ro' + link.get('href')
           href = rawhref.replace('http://www.publi24.rohttp://www.publi24.ro','http://www.publi24.ro')
           ad_title(href)
           views(href)
           print(href)
       page += 1

def ad_title(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for item_name in soup.findAll('h1', {'itemprop':'name'}):
       print(item_name.string)

def get_adid(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for link in soup.findAll('span', {'class':'fa fa-eye'}):
       adid = link.get('ng-init')
       adid_num = adid.split('=')[1]
       return adid_num

def views(item_url):
   # browser = webdriver.PhantomJS(r'D:\phantomjs-2.1.1-windows\bin\phantomjs.exe')
   chromedriver = r'D:\chromedriver_win32\chromedriver.exe'
   os.environ['webdriver.chrome.driver'] = chromedriver
   browser = webdriver.Chrome(chromedriver)
   browser.get(item_url)
   time.sleep(1)
   soup = BeautifulSoup(browser.page_source, 'html.parser')
   dict = {'add-view':'1'}
   dict['add-view'] = get_adid(item_url)
   tag = soup.find('span', dict)
   print(tag.text)
   browser.quit()

crawler(1)
now i'm working on a project based on this one where i can add persistence to the output and store it into something (i'm leaning towards excel cuz i've been working with it 'manually' so far) but i need to learn more and play with it before i have any real questions but for now i think this topic can be closed.

thanks again
Reply


Messages In This Thread
RE: web crawler that retrieves data not stored in source code - by edithegodfather - Jan-14-2017, 01:01 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Hide source code from python process itself xmghe 2 1,876 Jan-27-2021, 04:04 PM
Last Post: xmghe
  Web Crawler help Mr_Mafia 2 1,887 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  scraping from a website that hides source code PIWI_Protein 1 1,965 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Web Crawler help takaa 39 27,225 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Python requests.get() returns broken source code instead of expected source code? FatalPythonError 3 3,719 Sep-21-2018, 02:46 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020