Python Forum
Thread Rating:
  • 1 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Crawler help
#31
Since today I have an issue. The BeautifulSoup is not showing all the HTML code of the requested page anymore. 

If i print 
print(soup)
I am not getting all the code that I see when I "Inspect source code" of the web page. Before today this was the same and i had no problem running my code. If i now run the code this is the result:

import requests
from bs4 import BeautifulSoup
import re
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
 
        print(ads)
 
        page += 1
 
fundaSpider(1)
Output:
[]   Process finished with exit code 0
In my web browser I have no problem accessing the web page. Is it possible that the website is blocking the crawler, but not me as a person? Is there any way I can keep running the crawler? (just for the record, I use this crawler only for personal use and run it a few times per week).
Reply
#32
if you write the html to a file and open that file in the browser you will see what your crawler is getting

import requests
from bs4 import BeautifulSoup
import re
  
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        with open('test.html','w') as f:
            f.write(plain_text.encode('utf-8'))
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
  
        print(ads)
  
        page += 1
  
fundaSpider(1)
In my case i am getting a captcha verification. Not sure what is triggering the captcha, but there isnt really a automation method around it....as its purpose to verify a human.
Recommended Tutorials:
Reply
#33
Tnx!

In Dutch it says " We suspect you are a robot, you are visiting our website from a network that visits us a lot" 

When I visit the URL directly in my browser I am not getting any message, which makes me suspect he looks for something else as well. (if I complete the captcha manually and press verify I get a 404 error page. 

Do you think it could make a difference if the user agent heading is changed? I tried the the following but I got the same outcome. 

import requests
from bs4 import BeautifulSoup
 
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p1'
        headers = {'User-Agent': 'Mozilla/5.0 '}
 
        source_code = requests.get(url, headers=headers)
        plain_text = source_code.text
        with open('test.html', 'w') as f:
            f.write(plain_text.encode('utf-8'))
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
 
        print(ads)
 
        page += 1
 
 
fundaSpider(1)
I am trying to read up on the subject, do you think this could be potentially solved by using a proxy(service)?
Reply
#34
I tried giving it headers, and using tor proxy but i kept getting the captcha too. It could be tor nodes are flagged also.
http://docs.python-requests.org/en/maste...d/#proxies

Quote:(just for the record, I use this crawler only for personal use and run it a few times per week).
I had a site that i kept having trouble with. And i just ended up using selenium to bring up the browser so i could manually enter the captcha, then allow my bot to do the automation of everything else. If you truly hit a road block with javascript or captchas, this will always work as a backup.

Your web crawler via BeuatifulSoup would be the same, its just grabbing the html with selenium instead of requests.

Quote:When I visit the URL directly in my browser I am not getting any message, which makes me suspect he looks for something else as well. (if I complete the captcha manually and press verify I get a 404 error page. 
By this i think you mean you are only getting the captcha via the bot with the saved html file? You would be getting a 404 because the its a local file saved, not the site itself. With selenium you can do the same on their site.

The best way to not get your bot flagged, is to know how to flag them.
http://www.blogtips.org/web-crawlers-lov...-the-ugly/
Recommended Tutorials:
Reply
#35
(Feb-21-2017, 02:18 AM)metulburr Wrote: I had a site that i kept having trouble with. And i just ended up using selenium to bring up the browser so i could manually enter the captcha, then allow my bot to do the automation of everything else. If you truly hit a road block with javascript or captchas, this will always work as a backup.

Your web crawler via BeuatifulSoup would be the same, its just grabbing the html with selenium instead of requests.

Since I dont need to run the script often it would be a perfect solution for me to just enter the captcha manually. I have tried to open the browser through selenium using the following code, but nothing happens. 

import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re
browser = webdriver.Firefox()
 
 
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        browser.get('url')
        source_code = selenium.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
Output:
Process finished with exit code 0
I am having firefox developers edition installed. 

If you can help me in the right direction on how to use selenium that would be very helpful.
Reply
#36
It would be something along the lines of....
#import requests
from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
#open('output.csv', 'w').close()
import re
import time

#browser = webdriver.Firefox()
browser = webdriver.Chrome('/home/metulburr/chromedriver')
browser.set_window_position(0,0)
  
def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:

        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        browser.get(url)
        time.sleep(1)#normal delay to allow browser to load content
        raw_input('Press Enter after bypassing Captcha')
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            print(title)
        page += 1
        
fundaSpider(2)
 couple of things.....
- i used chrome as i dont have the firefox driver on my system.
- i repositioned the window to the top left, because i have dual monitors and it puts it on my TV when i run it if i dont put it there
- you can use PhantomJS to keep it in the background instead of popping up a browser.
- i kept trying to bring up the captcha, and this time i didnt get one, so i didnt know exactly what occurs after the captcha is entered, or how often it occurs.....ITs just an input to stop the program until the captcha is entered, currently is placed for every page. This is assuming the captchas gets triggered on every page. If you only get the captcha on the first time, you can move the input out of the while loop, however you are going to need to a do a captcha trigger get()
such as....
browser.get(url) #trigger captcha by going to the first page
raw_input() #halts program until after captcha is entered
...
while page <= max_pages:
    ...
    browser.get(url) # now go to true page as captcha will not be triggered
    ...
Recommended Tutorials:
Reply
#37
Do you have a final version of the code you can share with me?
Thanks!
Reply
#38
(Jan-28-2019, 12:39 PM)Stoss Wrote: Do you have a final version of the code you can share with me?
Thanks!

Sure, but it is not working anymore after I used it for about 2 weeks. I only needed it temporarily and didn't check for ways to get it working again.

import requests
from bs4 import BeautifulSoup
open('output.csv', 'w').close()
import re

def fundaSpider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.funda.nl/koop/rotterdam/p{}'.format(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')
        ads = soup.find_all('li', {'class': 'search-result'})
        for ad in ads:
            title = ad.find('h3')
            title = ' '.join(title.get_text(separator='\n', strip=True).split()[
                             :-3])  # sep by newline, strip whitespace, then split to get the last 3 elements to cut out, then rejoin
            street = title.rpartition(' ')[0]
            street = re.sub(r'\d+$', '', street)
            address = ad.find('small').text.strip()
            price = ad.find('div', {'class': 'search-result-info search-result-info-price'})
            price = price.find('span').text.strip()
            price = re.findall(r'\d', price)
            price = ''.join(price)
            size_results = ad.find('ul', {'class': 'search-result-kenmerken'})
            li = size_results.find_all('li')
            try:
                size = li[0]
            except IndexError:
                size = 'Unknown'
            try:
                size = size.get_text(strip=True)
            except AttributeError:
                size = 'Unknown'
            try:
                size = size.split(" ")[0]
            except IndexError:
                size = 'Unknown'
            try:
                room = li[1].text.strip()
            except IndexError:
                room = 'Unknown'
            try:
                room = room.split(" ")[0]
            except IndexError:
                room = 'Unknown'
            try:
                href = ('http://www.funda.nl' + ad.find_all('a')[2]['href'])
            except IndexError:
                pass

            area = get_single_item_data(href)
            if not area:
                area = str('None')
            since = get_single_item_data_2(href)
            if not since:
                since = 'None'
            status = get_single_item_data_3(href)
            if not status:
                status = 'None'

            print('{},{},{},{},{},{},{},{},{},{}'.format(title,address,street,price,size,room,area,since,status,href))
            saveFile = open('output.csv', 'a')
            saveFile.write(title + "," + address + "," + street + "," + price + "," + size + "," + room + "," + area + "," + since + "," + status + "," + href + '\n')
            saveFile.close()
        page += 1

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    li = soup.find_all('li', {'class': 'breadcrumb-listitem'})
    try:
        return (li[2].a.text)
    except AttributeError:
        pass

def get_single_item_data_2(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    dl = soup.find('dl', {'class': 'object-kenmerken-list'})
    try:
        #return (dl.find_all('dd')[1].text.strip())
        return dl.find('dt', text='Aangeboden sinds').find_next_sibling('dd').text.strip()
    except AttributeError:
        pass
def get_single_item_data_3(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    uls = soup.find_all('ul', {'class': 'labels'})
    for ul in uls:
        try:
            return(ul.find('li').text.strip())
        except AttributeError:
            pass

fundaSpider(2)
Reply
#39
(Jan-28-2019, 12:39 PM)Stoss Wrote: Do you have a final version of the code you can share with me? Thanks!
(Jan-30-2019, 08:35 AM)takaa Wrote: Sure, but it is not working anymore after I used it for about 2 weeks.
I will often have to update my web crawlers. Web sites will often change their code, sometimes to purposely stop such bots. Most of the time its just a changed class name or id, etc. Sometimes it has to be a little bit more involved.
Recommended Tutorials:
Reply
#40
Anybody find out why it quite working?

I can't find the problem. so far i can tell the terms are the same
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web Crawler help Mr_Mafia 2 1,847 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020