Python Forum
2 approaches to start webscrapping - question - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: 2 approaches to start webscrapping - question (/thread-22794.html)



2 approaches to start webscrapping - question - zarize - Nov-27-2019

Hi,

I found out that 2 approaches to web scrapping is returning totally different output:

What i mean is that when i use code below:
headers        = {
                'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'accept-encoding':'gzip, deflate, sdch, br',
                'accept-language':'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
                'cache-control':'max-age=0',
                'upgrade-insecure-requests':'1',
                'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    }
response       = requests.get(url,headers=headers)
parser         = response.content
soup           = BeautifulSoup(parser, "html.parser")
print(soup)
i get returned full code from the website, BUT:

If I would like to use pattern like:

r = requests.get(page)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')

print(soup)
it would redirect from the provided URL to captcha solver site and then it woult return code from captcha website


Someone made first approach some time ago and i am wondering how to follow exactly this path, and why my approach is returning captcha and first approach is avoiding it?


RE: 2 approaches to start webscrapping - question - buran - Nov-27-2019

the difference is that the first snippet provides headers that mimic headers from a 'real' browser. In the second snippet you don't provide headers, so it's more clear for the site that it's not 'real' browser, but script/robot.
If the site doesn't want to be scrapped they may use different tools and techniques to detect bots scrapping content and black-list/block them. At the same time scrappers will use different tools to avoid detection, e.g. in this case providing headers. you may use proxy rotation, random intervals between requests to mimic human behaviour, etc.. It's a long topic all together
As a side note that response.text and response.content have different purpose and return different results. https://requests.readthedocs.io/en/master/user/quickstart/#response-content


RE: 2 approaches to start webscrapping - question - zarize - Nov-27-2019

Thank you very much buran!
As always you made all clear :P