web crawler that retrieves data not stored in source code

edithegodfather · (This post was last modified: Jan-10-2017, 06:50 AM by snippsat.)

hey guys, i was wondering if you can help me out with something on this crawler.

i'm almost done with it but i'm trying to figure out a way of filtering out some of the results it prints out when it gathers all the links in a page.

in the standard ad page listing you see something like this:

http://www.publi24.ro/anunturi/imobiliar...sti/?pag=2

the top 3 ads are always promoted ads which are some sort of special ads that people pay to have them at the top of the page all the time.
and then you have the rest of the ads which are not paid for.
thing is that the both premium and free ads have an 'a' tag with a 'itemprop' class and a 'name' item followed by the href tag which is the one that has the link inside it.

the paid ads have a href that looks like this:
http://www.publi24.ro/anunturi/imobiliar...e-baba-nov...
and the free ones look like this:
/anunturi/imobiliare/de-inchiriat/apartamente/apartamente-3-camere/anunt/Apartament-3-camere-Bucuresti-250ronzi-Regi...

the paid ones have the http... string in front of them which make my function unhappy :))

def crawler(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           href = 'http://www.publi24.ro' + link.get('href')
           # ad_title(href)
           # views(href)
           print(href)
       page += 1

problem is that my output looks something like this:

Output:http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/case/vila/anunt/vila-pentru-petrece...
http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/apartamente/apartamente-2-camere/anunt/victoriei-calea-apartame...
http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/spatii-comerciale/spatiu-comercial/anunt/inchiriez-spatiu-comerc...
http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/apartament-2-Camere-Drumul-Taberei...
http://www.publi24.ro/anunturi/imobiliare/de-inchiriat/spatii-comerciale/birou/anunt/Inchiriere-birou-219mp-in-cladire-birouri-Cal...
[...]

so first 3 ads get messed up because of how the function is set up but most of the ads have a href that doesn't have the domain included.

my question is: is there any way i can filter out the first 3 results all the time? i checked and they show up on all the pages.
i was thinking maybe i can store the href variable in a list and start printing it from elem 3 but i'm not sure how to do that.
can you help me with that?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Hide source code from python process itself	xmghe	2	1,884	Jan-27-2021, 04:04 PM Last Post: xmghe
	Web Crawler help	Mr_Mafia	2	1,900	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	scraping from a website that hides source code	PIWI_Protein	1	1,973	Mar-27-2020, 05:08 PM Last Post: Larz60+
	Web Crawler help	takaa	39	27,287	Apr-26-2019, 12:14 PM Last Post: stateitreal
	Python requests.get() returns broken source code instead of expected source code?	FatalPythonError	3	3,735	Sep-21-2018, 02:46 PM Last Post: nilamo

web crawler that retrieves data not stored in source code

User Panel Messages

Announcements