Python Forum
web crawler that retrieves data not stored in source code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web crawler that retrieves data not stored in source code
#11
yeah, i checked the page source and it works just fine; i don't think there's gonna be much variation in there, unless they change the whole layout of the website but in that case it's not just gonna be 1 tag that doesn't match. :D

anyway, now i got the link part, the title part and the ad id part and all i need to do is convert the adid into views.

i'm using the code you guys gave me:

# adid is 18238521
# views is 4
href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/garsoniera/anunt/Garsoniera-Sector-1/7b006674706c6156.html'

def get_adid(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for link in soup.findAll('span', {'class':'fa fa-eye'}):
        adid = link.get('ng-init')
        num = adid.split('=')[1]
        print(num)

def views(item_url):
    browser = webdriver.PhantomJS(r'D:\phantomjs-2.1.1-windows\bin\phantomjs.exe')
    browser.get(item_url)
    time.sleep(0)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    tag = soup.find('span', {'add-view':get_adid(href)})
    print(tag.text)
    browser.quit()
but i'm not sure how to pass the adid from the get_adid() function into the dictionary from the 'tag' variable from the views() function. i tried putting it in there but it just prints the adid instead.
i thought about zipping 2 lists together but it doesn't print out a dictionary and both lists have to be the same length whereas here i'm trying to fit 'add-view' with whatever id i can get in there.
Reply
#12
hey guys, i was wondering if you can help me out with something on this crawler.

i'm almost done with it but i'm trying to figure out a way of filtering out some of the results it prints out when it gathers all the links in a page.

in the standard ad page listing you see something like this:

http://www.publi24.ro/anunturi/imobiliar...sti/?pag=2

the top 3 ads are always promoted ads which are some sort of special ads that people pay to have them at the top of the page all the time.
and then you have the rest of the ads which are not paid for.
thing is that the both premium and free ads have an 'a' tag with a 'itemprop' class and a 'name' item followed by the href tag which is the one that has the link inside it.

the paid ads have a href that looks like this:
http://www.publi24.ro/anunturi/imobiliar...e-baba-nov...
and the free ones look like this:
/anunturi/imobiliare/de-inchiriat/apartamente/apartamente-3-camere/anunt/Apartament-3-camere-Bucuresti-250ronzi-Regi...

the paid ones have the http... string in front of them which make my function unhappy :))
def crawler(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           href = 'http://www.publi24.ro' + link.get('href')
           # ad_title(href)
           # views(href)
           print(href)
       page += 1
problem is that my output looks something like this:
Output:
http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/case/vila/anunt/vila-pentru-petrece... http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/apartamente/apartamente-2-camere/anunt/victoriei-calea-apartame... http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/spatii-comerciale/spatiu-comercial/anunt/inchiriez-spatiu-comerc... http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/apartament-2-Camere-Drumul-Taberei... http://www.publi24.ro/anunturi/imobiliare/de-inchiriat/spatii-comerciale/birou/anunt/Inchiriere-birou-219mp-in-cladire-birouri-Cal... [...]
so first 3 ads get messed up because of how the function is set up but most of the ads have a href that doesn't have the domain included.

my question is: is there any way i can filter out the first 3 results all the time? i checked and they show up on all the pages.
i was thinking maybe i can store the href variable in a list and start printing it from elem 3 but i'm not sure how to do that.
can you help me with that?
Reply
#13
Check if link.get('href') has http.
If it has do nothing,else add http link.
Eg:
from bs4 import BeautifulSoup

html = '''\
<a href="http://www.publi24.ro/anunturi/"</a>
<a href="/anunturi/imobiliare/de-vanzare/"</a>'''

soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    if 'http' in link.get('href'):
        print(link.get('href'))
    else:
        print('http://www.publi24.ro{}'.format(link.get('href')))
Output:
http://www.publi24.ro/anunturi/ http://www.publi24.ro/anunturi/imobiliare/de-vanzare/
Use code tag in your post,i have added it for you now.
Reply
#14
hey, thanks for the reply, sorry i couldn't get back to you sooner, i started work again and i don't have that much time on my hands anymore.
also sorry about the code tag thing, i kept noticing it was being formatted but i thought it was just doing itself; i'll be sure to put it in there on further replies. :)

i'm also getting some errors on this code right now and i'm trying to figure out how to make it work.

[edit]

ok, i figured it out; i made a new variable where i used the replace function

def crawler(max_pages):
    page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           href = 'http://www.publi24.ro' + link.get('href')
           href2 = href.replace('http://www.publi24.rohttp://www.publi24.ro','http://www.publi24.ro')
           # ad_title(href)
           # views(href)
           print(href2)
       page += 1
Reply
#15
hey guys, i just finished the crawler project for the romanian ads website, here's the source code for anyone who wants to play with it; you guys helped me a great deal with it so i feel it belongs to you as much as it belongs to me:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

def crawler(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           rawhref = 'http://www.publi24.ro' + link.get('href')
           href = rawhref.replace('http://www.publi24.rohttp://www.publi24.ro','http://www.publi24.ro')
           ad_title(href)
           views(href)
           print(href)
       page += 1

def ad_title(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for item_name in soup.findAll('h1', {'itemprop':'name'}):
       print(item_name.string)

def get_adid(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for link in soup.findAll('span', {'class':'fa fa-eye'}):
       adid = link.get('ng-init')
       adid_num = adid.split('=')[1]
       return adid_num

def views(item_url):
   # browser = webdriver.PhantomJS(r'D:\phantomjs-2.1.1-windows\bin\phantomjs.exe')
   chromedriver = r'D:\chromedriver_win32\chromedriver.exe'
   os.environ['webdriver.chrome.driver'] = chromedriver
   browser = webdriver.Chrome(chromedriver)
   browser.get(item_url)
   time.sleep(1)
   soup = BeautifulSoup(browser.page_source, 'html.parser')
   dict = {'add-view':'1'}
   dict['add-view'] = get_adid(item_url)
   tag = soup.find('span', dict)
   print(tag.text)
   browser.quit()

crawler(1)
now i'm working on a project based on this one where i can add persistence to the output and store it into something (i'm leaning towards excel cuz i've been working with it 'manually' so far) but i need to learn more and play with it before i have any real questions but for now i think this topic can be closed.

thanks again
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Hide source code from python process itself xmghe 2 1,829 Jan-27-2021, 04:04 PM
Last Post: xmghe
  Web Crawler help Mr_Mafia 2 1,847 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  scraping from a website that hides source code PIWI_Protein 1 1,938 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Web Crawler help takaa 39 26,858 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Python requests.get() returns broken source code instead of expected source code? FatalPythonError 3 3,676 Sep-21-2018, 02:46 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020