Python Forum
web crawler that retrieves data not stored in source code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web crawler that retrieves data not stored in source code
#1
hey guys,

so i'm working right now on a small program that can help me gather ads from an ad website in my country (i'm gonna post some screenshots here; i'm not sure if i can post the website link but if that's ok i'll post that as well).

image link: i.redd.it/u0g9bjzb7l7y.png with https: in front of it

thing about this website, it's a pretty basic craigslist type of site with all the familiar categories you would expect on an ads page.
now the reason why i'm working on this is because this page, the way it works is, it allows you to post ads but it doesn't give you the option of sorting them by number of views, which is handy if you wanna find out which are the newest ads. . thing is, if you wanna sort the pages by date it does give you that option but it doesn't order them by the date when they were created, instead it orders them by the date an ad was updated. so let's say if an ad was 10 years old and you posted one today and that ad updated tomorrow, your ad would come second to the one that just updated even though it's newer. and for this particular scenario the number of views is perfect for determining an ad's age.

so i've been following the new boston's youtube python35 tutorials and i've managed to make the crawler grab the links of all the ads that are running in a certain category (i've used beautifulsoup4 and import modules) and it works like a beauty but then, when i get to the part where i try to have every link from every ad that's when i get into trouble. if i do the exact same thing i get noghing. initially i figured i was doing something wrong and it turned out i was, because what bs4 and import do is get all the data in the source code and put it in a file that python works with, but problem is, now, if you inspect tha page of an ad you'll see the number of views and a span tag that has "add-views=[ad id]" and then the number of views. but python still doesn't show the number of views of each ad, it just doesn't show anything. so i went into the source code itself and sure enough the number of views wasn't there either. so it seems that the viewcount is not stored on the page source but it's somewhere else and that's what i'm trying to figure out.

any ideas how to access the view number? let me know if you need links to the page or my source code.

thanks
Reply
#2
At the very least we would need the website and your source code. Does this site have an RSS feed? 

We would really need to see for ourselves as it better describes to us than someone describing to us.

Quote:so it seems that the viewcount is not stored on the page source but it's somewhere else and that's what i'm trying to figure out.
There is a chance that it is javascript, which would make it even harder to obtain
Recommended Tutorials:
Reply
#3
sure, the website is publi24.ro
it doesn't really have an english version but what you're looking for is:

publi24.ro/anunturi/locuri-de-munca/bucuresti/
or with page increment
publi24.ro/anunturi/locuri-de-munca/bucuresti/?pag=2
that's the main page with ads from a section (i picked the job seeking section)

and an ad page would look like this:
publi24.ro/anunturi/locuri-de-munca/anunt/Echipa-Tehnician-Alpinist-Telecom/7b00667478616b51.html

if you inspect element here and hover over the view count number (in this case "1") you'll see a tag that says: <span add-view="18230886">1</span> (it's not 1 anymore cuz i refreshed a bunch of times but you get the idea) :)
but in the source code it doesn't show anything
i also run my program and it brings up None as well.

i'm thinking maybe if i could make python run the page as if it were a user instead of grabbing the source code maybe it would pull up the view number and print it but i'm not sure. :-/
Reply
#4
If you inspect the element you are interpreting the source from the browsers eyes...and if its not there with python, then it means its javascript. You would have to get the source with selenium first before handing it off to BeaufitulSoup


for example
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

url = 'http://www.publi24.ro/anunturi/locuri-de-munca/anunt/Echipa-Tehnician-Alpinist-Telecom/7b00667478616b51.html'

def setup():
    '''
    setup webdriver and create browser
    '''
    #https://chromedriver.storage.googleapis.com/index.html
    #https://chromedriver.storage.googleapis.com/index.html?path=2.25/ ##latest
    chromedriver = "/home/metulburr/chromedriver" #the path to the chromedriver
    os.environ["webdriver.chrome.driver"] = chromedriver
    browser = webdriver.Chrome(chromedriver)
    return browser
    
browser = setup()
browser.get(url) 
time.sleep(2)

soup = BeautifulSoup(browser.page_source, 'lxml')
tag = soup.find('span', {'add-view':'18230886'})
print(tag.text)
browser.quit()
Output:
$ python test.py 16
Although this will pop a browser up for a couple seconds. IF you want you can use a headless browser to keep it in the background.
Recommended Tutorials:
Reply
#5
hey, thanks a lot, that actually worked :D
i did make some changes though to adapt the code to my system:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

url = 'http://www.publi24.ro/anunturi/locuri-de-munca/anunt/Echipa-Tehnician-Alpinist-Telecom/7b00667478616b51.html'

def setup():
   '''
   setup webdriver and create browser
   '''
   # https://chromedriver.storage.googleapis.com/index.html
   # https://chromedriver.storage.googleapis.com/index.html?path=2.25/ ##latest
   chromedriver = "D:\chromedriver_win32\chromedriver.exe"  # the path to the chromedriver
   os.environ["webdriver.chrome.driver"] = chromedriver
   browser = webdriver.Chrome(chromedriver)
   return browser

browser = setup()
browser.get(url)
time.sleep(0)

soup = BeautifulSoup(browser.page_source, 'html.parser')
tag = soup.find('span', {'add-view': '18230886'})
print(tag.text)
browser.quit()
i changed the location from "/home/metulburr/chromedriver" to where i had chromedriver.exe
changed the time.sleep from 2 to 0 to see if i could make it run faster and it worked
and i also changed the parser from lxml to html.parser because i was getting some errors and i managed to get rid of them with that.

now i'm gonna figure out how to extract all the adid's and run them through the soup.find tag and print each ad link with the number of views.

say, is there any way of doing this without having to run a browser window?
Reply
#6
Quote:say, is there any way of doing this without having to run a browser window?
PhantomJS
chromedriver = "D:\chromedriver_win32\chromedriver.exe"  # the path to the chromedriver
os.environ["webdriver.chrome.driver"] = chromedriver
browser = webdriver.Chrome(chromedriver)
# Change to
browser = webdriver.PhantomJS(path_to_phantomjs.exe)
Reply
#7
Quote:changed the time.sleep from 2 to 0 to see if i could make it run faster and it worked
Be aware that you might get different results based on loading times. For me if i dont give it some delay it loads without the value both chrome and phantomjs
Recommended Tutorials:
Reply
#8
no problem, i'll keep an eye out for it in case i miss out any values.
i'm gonna take that code now and put it together with the rest of it so it picks up the view nr from every link and prints it out alongside the link of each ad.
i'll keep you guys updated and thanks a lot for the help. :)
Reply
#9
ok, so i managed make the crawler pick the links and the titles now all i'm missing is the view count.

now for the view count there are 2 types of ads:
the ones that have been reposted automatically:

http://www.publi24.ro/anunturi/imobiliar...b6256.html

(notice the red Repostat automat under the date)

and the ones that have not:
http://www.publi24.ro/anunturi/imobiliar...f6755.html

now not all of them have been reposted automatically so we'll have to focus on the ones that have not.
for the second ad the id is: 18135642
you can find the number in the following lines of html:

<section class="s-ad-details" ng-controller="DetailView" ng-init="articleId=18135642; userId='780967757f696a'; uniqueAdId='d9aa70f0-2781-475b-8d66-af542ee5e8a1';logged=false; articlePrice = '32.700 EUR'">
<li><a rel="nofollow" href="/statistica-anunt-18135642.html"><span class="fa fa-eye" ng-init="ad.Id=18135642"></span>Vizualizari: <span add-view="18135642"></span></a></li>

the best way to go about this i figured was:

import requests
from bs4 import BeautifulSoup

href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/Giurgiului-Drumul-Gazarului-ideal-investitie/7b0065747d6f6755.html'

def get_adid(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for link in soup.findAll('span', {'class':'fa fa-eye'}):
       adid = link.get('ng-init')
       # adid_f = filter(adid, int)
       print(adid)

get_adid(href)


but it doesn't print just the number, it prints the whole thing which is: 'ad.Id=18135642'.

do you have any ideas how to filter out the stringy part of it? i tried the filter function but i don't think i'm using it properly.
Reply
#10
I am not sure if there is a BeautifulSoup method here to be honest. You can split it by the obvious delimiter there =, but that is assuming that is always in that format. Otherwise it may break the code. 
       num = adid.split('=')[1]
Recommended Tutorials:
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Hide source code from python process itself xmghe 2 1,829 Jan-27-2021, 04:04 PM
Last Post: xmghe
  Web Crawler help Mr_Mafia 2 1,845 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  scraping from a website that hides source code PIWI_Protein 1 1,930 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Web Crawler help takaa 39 26,850 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Python requests.get() returns broken source code instead of expected source code? FatalPythonError 3 3,673 Sep-21-2018, 02:46 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020