Python Forum
web crawler that retrieves data not stored in source code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web crawler that retrieves data not stored in source code
#9
ok, so i managed make the crawler pick the links and the titles now all i'm missing is the view count.

now for the view count there are 2 types of ads:
the ones that have been reposted automatically:

http://www.publi24.ro/anunturi/imobiliar...b6256.html

(notice the red Repostat automat under the date)

and the ones that have not:
http://www.publi24.ro/anunturi/imobiliar...f6755.html

now not all of them have been reposted automatically so we'll have to focus on the ones that have not.
for the second ad the id is: 18135642
you can find the number in the following lines of html:

<section class="s-ad-details" ng-controller="DetailView" ng-init="articleId=18135642; userId='780967757f696a'; uniqueAdId='d9aa70f0-2781-475b-8d66-af542ee5e8a1';logged=false; articlePrice = '32.700 EUR'">
<li><a rel="nofollow" href="/statistica-anunt-18135642.html"><span class="fa fa-eye" ng-init="ad.Id=18135642"></span>Vizualizari: <span add-view="18135642"></span></a></li>

the best way to go about this i figured was:

import requests
from bs4 import BeautifulSoup

href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/Giurgiului-Drumul-Gazarului-ideal-investitie/7b0065747d6f6755.html'

def get_adid(item_url):
   source_code = requests.get(item_url)
   plain_text = source_code.text
   soup = BeautifulSoup(plain_text, 'html.parser')
   for link in soup.findAll('span', {'class':'fa fa-eye'}):
       adid = link.get('ng-init')
       # adid_f = filter(adid, int)
       print(adid)

get_adid(href)


but it doesn't print just the number, it prints the whole thing which is: 'ad.Id=18135642'.

do you have any ideas how to filter out the stringy part of it? i tried the filter function but i don't think i'm using it properly.
Reply


Messages In This Thread
RE: web crawler that retrieves data not stored in source code - by edithegodfather - Jan-07-2017, 12:44 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Hide source code from python process itself xmghe 2 1,884 Jan-27-2021, 04:04 PM
Last Post: xmghe
  Web Crawler help Mr_Mafia 2 1,898 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  scraping from a website that hides source code PIWI_Protein 1 1,972 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Web Crawler help takaa 39 27,281 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Python requests.get() returns broken source code instead of expected source code? FatalPythonError 3 3,729 Sep-21-2018, 02:46 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020