Python Forum
web crawler that retrieves data not stored in source code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web crawler that retrieves data not stored in source code
#12
hey guys, i was wondering if you can help me out with something on this crawler.

i'm almost done with it but i'm trying to figure out a way of filtering out some of the results it prints out when it gathers all the links in a page.

in the standard ad page listing you see something like this:

http://www.publi24.ro/anunturi/imobiliar...sti/?pag=2

the top 3 ads are always promoted ads which are some sort of special ads that people pay to have them at the top of the page all the time.
and then you have the rest of the ads which are not paid for.
thing is that the both premium and free ads have an 'a' tag with a 'itemprop' class and a 'name' item followed by the href tag which is the one that has the link inside it.

the paid ads have a href that looks like this:
http://www.publi24.ro/anunturi/imobiliar...e-baba-nov...
and the free ones look like this:
/anunturi/imobiliare/de-inchiriat/apartamente/apartamente-3-camere/anunt/Apartament-3-camere-Bucuresti-250ronzi-Regi...

the paid ones have the http... string in front of them which make my function unhappy :))
def crawler(max_pages):
   page = 1
   while page <= max_pages:
       url = 'http://www.publi24.ro/anunturi/imobiliare/bucuresti/?pag=' + str(page)
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text, 'html.parser')
       for link in soup.findAll('a', {'itemprop':'name'}):
           href = 'http://www.publi24.ro' + link.get('href')
           # ad_title(href)
           # views(href)
           print(href)
       page += 1
problem is that my output looks something like this:
Output:
http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/case/vila/anunt/vila-pentru-petrece... http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/apartamente/apartamente-2-camere/anunt/victoriei-calea-apartame... http://www.publi24.rohttp://www.publi24.ro/anunturi/imobiliare/de-inchiriat/spatii-comerciale/spatiu-comercial/anunt/inchiriez-spatiu-comerc... http://www.publi24.ro/anunturi/imobiliare/de-vanzare/apartamente/apartamente-2-camere/anunt/apartament-2-Camere-Drumul-Taberei... http://www.publi24.ro/anunturi/imobiliare/de-inchiriat/spatii-comerciale/birou/anunt/Inchiriere-birou-219mp-in-cladire-birouri-Cal... [...]
so first 3 ads get messed up because of how the function is set up but most of the ads have a href that doesn't have the domain included.

my question is: is there any way i can filter out the first 3 results all the time? i checked and they show up on all the pages.
i was thinking maybe i can store the href variable in a list and start printing it from elem 3 but i'm not sure how to do that.
can you help me with that?
Reply


Messages In This Thread
RE: web crawler that retrieves data not stored in source code - by edithegodfather - Jan-10-2017, 05:58 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Hide source code from python process itself xmghe 2 1,884 Jan-27-2021, 04:04 PM
Last Post: xmghe
  Web Crawler help Mr_Mafia 2 1,900 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  scraping from a website that hides source code PIWI_Protein 1 1,973 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Web Crawler help takaa 39 27,287 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Python requests.get() returns broken source code instead of expected source code? FatalPythonError 3 3,735 Sep-21-2018, 02:46 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020