Python Forum

Full Version: Web Scraping with a Bot
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I wanted to set up a daily web-scraper for this site. The scraper does about 100 requests/hour every day, however the site seems to be blocking all bots. This is the url: '' and it is protected by some company called "Distil Networks". I couldn't find much online about a way around, so I am asking here. If there is a way around this, any help is appreciated.
When you GET the page do you change the User-agent?
I am using the modules: beautifulSoup, urllib, and requests. do does have user agents? I thought that was something only modules that mimicked browers carried, like Selenium.
import requests

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"

html = requests.get('', headers=headers).content
In headers, you can add additional information:

Python represents itself as Python, not as some web client. So it is obvious for any server that your script is not a browser. Also, you can try Selenium. Scroll down to see how you can use it:
I tried this exactly in IDE:
>>> import requests
>>> proxies = {
    'http': '',
    'https': '',
>>> s = requests.Session()
>>> s.proxies = proxies
>>> headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
>>> html = html = s.get('', headers=headers).content
This is what I got:
>>> print(html)
b'<!DOCTYPE html>\n<html>\n\n<head>\n  <title>Pardon Our Interruption</title>\n  <link rel="stylesheet" type="text/css" href="//" media="all">\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n  <meta name="viewport" content="width=1000" />\n  <META NAME="robots" CONTENT="noindex, nofollow">\n  <meta http-equiv="cache-control" content="max-age=0" />\n  <meta http-equiv="cache-control" content="no-cache" />\n  <meta http-equiv="expires" content="0" />\n  <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />\n  <meta http-equiv="pragma" content="no-cache" />\n</head>\n\n<body class=\'block-page\'>\n  <div class=\'container\'>\n    <div class=\'row\'>\n      <div class=\'sidebar col-lg-4 col-sm-5\'>\n        <img src="//" alt="0">\n      </div>\n      <div class=\'content col-lg-8 col-sm-7\'>\n        <h1>Pardon Our Interruption...</h1>\n        <p>\n          As you were browsing <strong></strong> something about your browser made us think you were a bot. There are a few reasons this might happen:\n        </p>\n        <ul>\n          <li>You\'re a power user moving through this website with super-human speed.</li>\n          <li>You\'ve disabled JavaScript in your web browser.</li>\n          <li>A third-party browser plugin, such as Ghostery or  NoScript, is preventing JavaScript from running. Additional information is available in this <a title=\'Third party browser plugins that block javascript\' href=\'\' target=\'_blank\'>support article</a>.</li>\n        </ul>\n        <p>\n          To request an unblock, please fill out the form below and we will review it as soon as possible.\n        </p>\n\n                <form id="auwxdsuebebe" method="POST" action="urffrvbezxxdyxtfwbzqfszwsqys.html" style="display:none"><label>Ignore: <input type="text" name="name" /></label><label>Ignore: <input type="text" name="email" /></label><label>Ignore: <input type="submit" value="Submit" /></label></form><form id="distilUnblockForm" method="post" action="">\n            <div id="dUF_first_name">\n                <label for="dUF_input_first_name">First Name:</label>\n                <input type="text" id="dUF_input_first_name" name="first_name" value="" />\n            </div>\n            <div id="dUF_last_name">\n                <label for="dUF_input_last_name">Last Name:</label>\n                <input type="text" id="dUF_input_last_name" name="last_name" value="" />\n            </div>\n            <div id="dUF_email">\n                <label for="dUF_input_email">E-mail:</label>\n                <input type="text" id="dUF_input_email" name="email" value="" />\n            </div>\n            <div id="dUF_city" style="display: none">\n                <label for="dUF_input_city">City (Leave Blank):</label>\n                <input type="text" id="dUF_input_city" name="city" value="" />\n            </div>\n            <div id="dUF_unblock">\n                <input  id="dUF_input_unblock" name="unblock" type="submit" value="Request Unblock" />\n            </div>\n            <div id="dUF_unblock_text">\n                You reached this page when attempting to access from on 2018-05-06 19:05:12 UTC.<br />\n                Trace: 4e79693c-c5da-408f-af1c-88bad413ca05 via 7eabcc2b-fd4d-429d-ad5b-61001d67db41\n            </div>\n            <div id="dUF_form_fields" style="display: none">\n                <input type="hidden" name="Q" value="4e79693c-c5da-408f-af1c-88bad413ca05" />\n                <input type="hidden" name="P" value="1E9B0FF7-9E1F-379F-A90E-F22277DBECF9" />\n                <input type="hidden" name="I" value="" />\n                <input type="hidden" name="U" value="" />\n                <input type="hidden" name="SF" value="" />\n                <input type="hidden" name="F" value="" />\n                <input type="hidden" name="V" value="2097153" />\n                <input type="hidden" name="D" value="12422" />\n                <input type="hidden" name="A" value="2961" />\n                <input type="hidden" name="LOADED" value="2018-05-06 19:05:12" />\n                <input type="hidden" name="H" value=\'\' />\n            </div>\n        </form>\n    \n\n      </div>\n    </div>\n  </div>\n</body>\n\n</html>\n'
I tried a proxy and changing a user agent, but still nothing. Interestingly, I tried a proxy in Google Chrome and it only asked me for a simple captcha once before it let the proxy browse through uninterrupted. Can this information help with anything?

Also, if it matters, here is the site I got proxies from:
If I remember correctly Distil has very advanced javascript bot detection. You'll need to be a very good coder to bypass it. You might want to try macros on your normal browser and slow down your requests, that might work.

Check your PMs, I sent you some details that might help.
They are complaining that JS could be turned off or you are navigating too fast. Try tu put some random delays. About JS... Try Selenium instead requests
Selenium doesnt work either. I was thinking of using my actual browser and automating that, but theyve banned my ip cause they know its connected to a bot. They havent really banned it but loading a page takes forever, so I guess they just dont want to take any chances with my IP. Selenium, requests, and urllib are staright up blocked even with proxies. Selenium says failed to connect and urllib and requests give error 416.

Is there any way they can track my actual browser if I use an elite proxy, slow down my browsing and sleep randomly, and turn off cookies for that site?
I think they know that you are using proxy server and that is suspicious at first.
I would use Selenium, slow down the the requests per second by a random value and hope they will respect that I don't want to rape their server. I can't think for something else. I am not networking guy.