Python Forum
2 approaches to start webscrapping - question
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
2 approaches to start webscrapping - question
#1
Hi,

I found out that 2 approaches to web scrapping is returning totally different output:

What i mean is that when i use code below:
headers        = {
                'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'accept-encoding':'gzip, deflate, sdch, br',
                'accept-language':'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
                'cache-control':'max-age=0',
                'upgrade-insecure-requests':'1',
                'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    }
response       = requests.get(url,headers=headers)
parser         = response.content
soup           = BeautifulSoup(parser, "html.parser")
print(soup)
i get returned full code from the website, BUT:

If I would like to use pattern like:

r = requests.get(page)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')

print(soup)
it would redirect from the provided URL to captcha solver site and then it woult return code from captcha website


Someone made first approach some time ago and i am wondering how to follow exactly this path, and why my approach is returning captcha and first approach is avoiding it?
Reply
#2
the difference is that the first snippet provides headers that mimic headers from a 'real' browser. In the second snippet you don't provide headers, so it's more clear for the site that it's not 'real' browser, but script/robot.
If the site doesn't want to be scrapped they may use different tools and techniques to detect bots scrapping content and black-list/block them. At the same time scrappers will use different tools to avoid detection, e.g. in this case providing headers. you may use proxy rotation, random intervals between requests to mimic human behaviour, etc.. It's a long topic all together
As a side note that response.text and response.content have different purpose and return different results. https://requests.readthedocs.io/en/maste...se-content
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Thank you very much buran!
As always you made all clear :P
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  OOP and module approaches in a simple app monitoring list of servers hjzxxzjcz 1 67,118 Nov-01-2019, 04:30 PM
Last Post: nilamo
  What's the difference b/w assigning start=None and start=" " Madara 1 2,416 Aug-06-2018, 08:23 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020