web crawler that retrieves data not stored in source code

edithegodfather · (This post was last modified: Jan-08-2017, 12:53 PM by Ofnuts.)

hey guys,

so i'm working right now on a small program that can help me gather ads from an ad website in my country (i'm gonna post some screenshots here; i'm not sure if i can post the website link but if that's ok i'll post that as well).

image link: i.redd.it/u0g9bjzb7l7y.png with https: in front of it

thing about this website, it's a pretty basic craigslist type of site with all the familiar categories you would expect on an ads page.
now the reason why i'm working on this is because this page, the way it works is, it allows you to post ads but it doesn't give you the option of sorting them by number of views, which is handy if you wanna find out which are the newest ads. . thing is, if you wanna sort the pages by date it does give you that option but it doesn't order them by the date when they were created, instead it orders them by the date an ad was updated. so let's say if an ad was 10 years old and you posted one today and that ad updated tomorrow, your ad would come second to the one that just updated even though it's newer. and for this particular scenario the number of views is perfect for determining an ad's age.

so i've been following the new boston's youtube python35 tutorials and i've managed to make the crawler grab the links of all the ads that are running in a certain category (i've used beautifulsoup4 and import modules) and it works like a beauty but then, when i get to the part where i try to have every link from every ad that's when i get into trouble. if i do the exact same thing i get noghing. initially i figured i was doing something wrong and it turned out i was, because what bs4 and import do is get all the data in the source code and put it in a file that python works with, but problem is, now, if you inspect tha page of an ad you'll see the number of views and a span tag that has "add-views=[ad id]" and then the number of views. but python still doesn't show the number of views of each ad, it just doesn't show anything. so i went into the source code itself and sure enough the number of views wasn't there either. so it seems that the viewcount is not stored on the page source but it's somewhere else and that's what i'm trying to figure out.

any ideas how to access the view number? let me know if you need links to the page or my source code.

thanks

***metulburr*** · (This post was last modified: Jan-05-2017, 12:57 AM by metulburr.)

At the very least we would need the website and your source code. Does this site have an RSS feed?

We would really need to see for ourselves as it better describes to us than someone describing to us.

Quote:so it seems that the viewcount is not stored on the page source but it's somewhere else and that's what i'm trying to figure out.

There is a chance that it is javascript, which would make it even harder to obtain

edithegodfather · Jan-05-2017, 01:24 AM

sure, the website is publi24.ro
it doesn't really have an english version but what you're looking for is:

publi24.ro/anunturi/locuri-de-munca/bucuresti/
or with page increment
publi24.ro/anunturi/locuri-de-munca/bucuresti/?pag=2
that's the main page with ads from a section (i picked the job seeking section)

and an ad page would look like this:
publi24.ro/anunturi/locuri-de-munca/anunt/Echipa-Tehnician-Alpinist-Telecom/7b00667478616b51.html

if you inspect element here and hover over the view count number (in this case "1") you'll see a tag that says: 1 (it's not 1 anymore cuz i refreshed a bunch of times but you get the idea) :)
but in the source code it doesn't show anything
i also run my program and it brings up None as well.

i'm thinking maybe if i could make python run the page as if it were a user instead of grabbing the source code maybe it would pull up the view number and print it but i'm not sure. :-/

***metulburr*** · (This post was last modified: Jan-05-2017, 02:53 AM by metulburr.)

If you inspect the element you are interpreting the source from the browsers eyes...and if its not there with python, then it means its javascript. You would have to get the source with selenium first before handing it off to BeaufitulSoup

for example

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

url = 'http://www.publi24.ro/anunturi/locuri-de-munca/anunt/Echipa-Tehnician-Alpinist-Telecom/7b00667478616b51.html'

def setup():
    '''
    setup webdriver and create browser
    '''
    #https://chromedriver.storage.googleapis.com/index.html
    #https://chromedriver.storage.googleapis.com/index.html?path=2.25/ ##latest
    chromedriver = "/home/metulburr/chromedriver" #the path to the chromedriver
    os.environ["webdriver.chrome.driver"] = chromedriver
    browser = webdriver.Chrome(chromedriver)
    return browser
    
browser = setup()
browser.get(url) 
time.sleep(2)

soup = BeautifulSoup(browser.page_source, 'lxml')
tag = soup.find('span', {'add-view':'18230886'})
print(tag.text)
browser.quit()

Output:$ python test.py
16

Although this will pop a browser up for a couple seconds. IF you want you can use a headless browser to keep it in the background.

edithegodfather · (This post was last modified: Jan-05-2017, 08:52 PM by snippsat.)

hey, thanks a lot, that actually worked :D
i did make some changes though to adapt the code to my system:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os

url = 'http://www.publi24.ro/anunturi/locuri-de-munca/anunt/Echipa-Tehnician-Alpinist-Telecom/7b00667478616b51.html'

def setup():
   '''
   setup webdriver and create browser
   '''
   # https://chromedriver.storage.googleapis.com/index.html
   # https://chromedriver.storage.googleapis.com/index.html?path=2.25/ ##latest
   chromedriver = "D:\chromedriver_win32\chromedriver.exe"  # the path to the chromedriver
   os.environ["webdriver.chrome.driver"] = chromedriver
   browser = webdriver.Chrome(chromedriver)
   return browser

browser = setup()
browser.get(url)
time.sleep(0)

soup = BeautifulSoup(browser.page_source, 'html.parser')
tag = soup.find('span', {'add-view': '18230886'})
print(tag.text)
browser.quit()

i changed the location from "/home/metulburr/chromedriver" to where i had chromedriver.exe
changed the time.sleep from 2 to 0 to see if i could make it run faster and it worked
and i also changed the parser from lxml to html.parser because i was getting some errors and i managed to get rid of them with that.

now i'm gonna figure out how to extract all the adid's and run them through the soup.find tag and print each ad link with the number of views.

say, is there any way of doing this without having to run a browser window?

***snippsat*** · (This post was last modified: Jan-05-2017, 08:54 PM by snippsat.)

Quote:say, is there any way of doing this without having to run a browser window?

PhantomJS

chromedriver = "D:\chromedriver_win32\chromedriver.exe"  # the path to the chromedriver
os.environ["webdriver.chrome.driver"] = chromedriver
browser = webdriver.Chrome(chromedriver)
# Change to
browser = webdriver.PhantomJS(path_to_phantomjs.exe)

***metulburr*** · Jan-05-2017, 09:27 PM

Quote:changed the time.sleep from 2 to 0 to see if i could make it run faster and it worked

Be aware that you might get different results based on loading times. For me if i dont give it some delay it loads without the value both chrome and phantomjs

edithegodfather · Jan-06-2017, 03:07 PM

no problem, i'll keep an eye out for it in case i miss out any values.
i'm gonna take that code now and put it together with the rest of it so it picks up the view nr from every link and prints it out alongside the link of each ad.
i'll keep you guys updated and thanks a lot for the help. :)

edithegodfather · Jan-07-2017, 12:44 AM

ok, so i managed make the crawler pick the links and the titles now all i'm missing is the view count.

now for the view count there are 2 types of ads:
the ones that have been reposted automatically:

http://www.publi24.ro/anunturi/imobiliar...b6256.html

(notice the red Repostat automat under the date)

and the ones that have not:
http://www.publi24.ro/anunturi/imobiliar...f6755.html

now not all of them have been reposted automatically so we'll have to focus on the ones that have not.
for the second ad the id is: 18135642
you can find the number in the following lines of html:

<section class="s-ad-details" ng-controller="DetailView" ng-init="articleId=18135642; userId='780967757f696a'; uniqueAdId='d9aa70f0-2781-475b-8d66-af542ee5e8a1';logged=false; articlePrice = '32.700 EUR'">
<li><a rel="nofollow" href="/statistica-anunt-18135642.html">Vizualizari: </a></li>

the best way to go about this i figured was:

import requests
from bs4 import BeautifulSoup

href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/Giurgiului-Drumul-Gazarului-ideal-investitie/7b0065747d6f6755.html'

def get_adid(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('span', {'class':'fa fa-eye'}):
adid = link.get('ng-init')
# adid_f = filter(adid, int)
print(adid)

get_adid(href)

but it doesn't print just the number, it prints the whole thing which is: 'ad.Id=18135642'.

do you have any ideas how to filter out the stringy part of it? i tried the filter function but i don't think i'm using it properly.

***metulburr*** · Jan-07-2017, 04:28 AM

I am not sure if there is a BeautifulSoup method here to be honest. You can split it by the obvious delimiter there =, but that is assuming that is always in that format. Otherwise it may break the code.

       num = adid.split('=')[1]

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Hide source code from python process itself	xmghe	2	1,859	Jan-27-2021, 04:04 PM Last Post: xmghe
	Web Crawler help	Mr_Mafia	2	1,880	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	scraping from a website that hides source code	PIWI_Protein	1	1,958	Mar-27-2020, 05:08 PM Last Post: Larz60+
	Web Crawler help	takaa	39	27,203	Apr-26-2019, 12:14 PM Last Post: stateitreal
	Python requests.get() returns broken source code instead of expected source code?	FatalPythonError	3	3,705	Sep-21-2018, 02:46 PM Last Post: nilamo

web crawler that retrieves data not stored in source code

User Panel Messages

Announcements