Jan-07-2017, 12:44 AM
ok, so i managed make the crawler pick the links and the titles now all i'm missing is the view count.
now for the view count there are 2 types of ads:
the ones that have been reposted automatically:
http://www.publi24.ro/anunturi/imobiliar...b6256.html
(notice the red Repostat automat under the date)
and the ones that have not:
http://www.publi24.ro/anunturi/imobiliar...f6755.html
now not all of them have been reposted automatically so we'll have to focus on the ones that have not.
for the second ad the id is: 18135642
you can find the number in the following lines of html:
<section class="s-ad-details" ng-controller="DetailView" ng-init="articleId=18135642; userId='780967757f696a'; uniqueAdId='d9aa70f0-2781-475b-8d66-af542ee5e8a1';logged=false; articlePrice = '32.700 EUR'">
<li><a rel="nofollow" href="/statistica-anunt-18135642.html"><span class="fa fa-eye" ng-init="ad.Id=18135642"></span>Vizualizari: <span add-view="18135642"></span></a></li>
the best way to go about this i figured was:
import requests
from bs4 import BeautifulSoup
href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/Giurgiului-Drumul-Gazarului-ideal-investitie/7b0065747d6f6755.html'
def get_adid(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('span', {'class':'fa fa-eye'}):
adid = link.get('ng-init')
# adid_f = filter(adid, int)
print(adid)
get_adid(href)
but it doesn't print just the number, it prints the whole thing which is: 'ad.Id=18135642'.
do you have any ideas how to filter out the stringy part of it? i tried the filter function but i don't think i'm using it properly.
now for the view count there are 2 types of ads:
the ones that have been reposted automatically:
http://www.publi24.ro/anunturi/imobiliar...b6256.html
(notice the red Repostat automat under the date)
and the ones that have not:
http://www.publi24.ro/anunturi/imobiliar...f6755.html
now not all of them have been reposted automatically so we'll have to focus on the ones that have not.
for the second ad the id is: 18135642
you can find the number in the following lines of html:
<section class="s-ad-details" ng-controller="DetailView" ng-init="articleId=18135642; userId='780967757f696a'; uniqueAdId='d9aa70f0-2781-475b-8d66-af542ee5e8a1';logged=false; articlePrice = '32.700 EUR'">
<li><a rel="nofollow" href="/statistica-anunt-18135642.html"><span class="fa fa-eye" ng-init="ad.Id=18135642"></span>Vizualizari: <span add-view="18135642"></span></a></li>
the best way to go about this i figured was:
import requests
from bs4 import BeautifulSoup
href = 'http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/Giurgiului-Drumul-Gazarului-ideal-investitie/7b0065747d6f6755.html'
def get_adid(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('span', {'class':'fa fa-eye'}):
adid = link.get('ng-init')
# adid_f = filter(adid, int)
print(adid)
get_adid(href)
but it doesn't print just the number, it prints the whole thing which is: 'ad.Id=18135642'.
do you have any ideas how to filter out the stringy part of it? i tried the filter function but i don't think i'm using it properly.