2 approaches to start webscrapping - question

zarize · Nov-27-2019, 01:49 PM

Hi,

I found out that 2 approaches to web scrapping is returning totally different output:

What i mean is that when i use code below:

headers        = {
                'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'accept-encoding':'gzip, deflate, sdch, br',
                'accept-language':'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
                'cache-control':'max-age=0',
                'upgrade-insecure-requests':'1',
                'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    }
response       = requests.get(url,headers=headers)
parser         = response.content
soup           = BeautifulSoup(parser, "html.parser")
print(soup)

i get returned full code from the website, BUT:

If I would like to use pattern like:

r = requests.get(page)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')

print(soup)

it would redirect from the provided URL to captcha solver site and then it woult return code from captcha website

Someone made first approach some time ago and i am wondering how to follow exactly this path, and why my approach is returning captcha and first approach is avoiding it?

**buran** · (This post was last modified: Nov-27-2019, 01:56 PM by buran.)

the difference is that the first snippet provides headers that mimic headers from a 'real' browser. In the second snippet you don't provide headers, so it's more clear for the site that it's not 'real' browser, but script/robot.
If the site doesn't want to be scrapped they may use different tools and techniques to detect bots scrapping content and black-list/block them. At the same time scrappers will use different tools to avoid detection, e.g. in this case providing headers. you may use proxy rotation, random intervals between requests to mimic human behaviour, etc.. It's a long topic all together

As a side note that response.text and response.content have different purpose and return different results. https://requests.readthedocs.io/en/maste...se-content

zarize · Nov-27-2019, 02:06 PM

Thank you very much buran!
As always you made all clear :P

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	OOP and module approaches in a simple app monitoring list of servers	hjzxxzjcz	1	67,118	Nov-01-2019, 04:30 PM Last Post: nilamo
	What's the difference b/w assigning start=None and start=" "	Madara	1	2,416	Aug-06-2018, 08:23 AM Last Post: buran

2 approaches to start webscrapping - question

User Panel Messages

Announcements