web crawler that retrieves data not stored in source code

edithegodfather · Jan-07-2017, 12:44 AM

ok, so i managed make the crawler pick the links and the titles now all i'm missing is the view count.

now for the view count there are 2 types of ads:
the ones that have been reposted automatically:

http://www.publi24.ro/anunturi/imobiliar...b6256.html

(notice the red Repostat automat under the date)

and the ones that have not:
http://www.publi24.ro/anunturi/imobiliar...f6755.html

now not all of them have been reposted automatically so we'll have to focus on the ones that have not.
for the second ad the id is: 18135642
you can find the number in the following lines of html:

<section class="s-ad-details" ng-controller="DetailView" ng-init="articleId=18135642; userId='780967757f696a'; uniqueAdId='d9aa70f0-2781-475b-8d66-af542ee5e8a1';logged=false; articlePrice = '32.700 EUR'">
<li><a rel="nofollow" href="/statistica-anunt-18135642.html"><span class="fa fa-eye" ng-init="ad.Id=18135642"></span>Vizualizari: <span add-view="18135642"></span></a></li>

the best way to go about this i figured was:

import requests
from bs4 import BeautifulSoup

href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/Giurgiului-Drumul-Gazarului-ideal-investitie/7b0065747d6f6755.html'

def get_adid(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('span', {'class':'fa fa-eye'}):
adid = link.get('ng-init')
# adid_f = filter(adid, int)
print(adid)

get_adid(href)

but it doesn't print just the number, it prints the whole thing which is: 'ad.Id=18135642'.

do you have any ideas how to filter out the stringy part of it? i tried the filter function but i don't think i'm using it properly.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Hide source code from python process itself	xmghe	2	1,884	Jan-27-2021, 04:04 PM Last Post: xmghe
	Web Crawler help	Mr_Mafia	2	1,898	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	scraping from a website that hides source code	PIWI_Protein	1	1,972	Mar-27-2020, 05:08 PM Last Post: Larz60+
	Web Crawler help	takaa	39	27,281	Apr-26-2019, 12:14 PM Last Post: stateitreal
	Python requests.get() returns broken source code instead of expected source code?	FatalPythonError	3	3,729	Sep-21-2018, 02:46 PM Last Post: nilamo

web crawler that retrieves data not stored in source code

User Panel Messages

Announcements