Python Forum

Full Version: 2 approaches to start webscrapping - question
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I found out that 2 approaches to web scrapping is returning totally different output:

What i mean is that when i use code below:
headers        = {
                'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'accept-encoding':'gzip, deflate, sdch, br',
                'accept-language':'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
                'cache-control':'max-age=0',
                'upgrade-insecure-requests':'1',
                'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    }
response       = requests.get(url,headers=headers)
parser         = response.content
soup           = BeautifulSoup(parser, "html.parser")
print(soup)
i get returned full code from the website, BUT:

If I would like to use pattern like:

r = requests.get(page)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')

print(soup)
it would redirect from the provided URL to captcha solver site and then it woult return code from captcha website


Someone made first approach some time ago and i am wondering how to follow exactly this path, and why my approach is returning captcha and first approach is avoiding it?
the difference is that the first snippet provides headers that mimic headers from a 'real' browser. In the second snippet you don't provide headers, so it's more clear for the site that it's not 'real' browser, but script/robot.
If the site doesn't want to be scrapped they may use different tools and techniques to detect bots scrapping content and black-list/block them. At the same time scrappers will use different tools to avoid detection, e.g. in this case providing headers. you may use proxy rotation, random intervals between requests to mimic human behaviour, etc.. It's a long topic all together
As a side note that response.text and response.content have different purpose and return different results. https://requests.readthedocs.io/en/maste...se-content
Thank you very much buran!
As always you made all clear :P