Apr-03-2018, 10:54 PM
I've been cycling through a list of 1000 URLs and scraping the source.. everything works great, but every once in a while I hit a problematic URL that keeps timing out over and over.
What is going on here?
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Max retries exceeded with url: /dc-assures-all-help-to-family-of-iraq-victim-balwant-rai/states/news/1188269.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x05F512B0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',)) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15) HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)Shouldn't it just timeout ONCE then move on?
What is going on here?
try: scrape = requests.get(df.iloc[list_counter, 0], headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"}, timeout=15) html = scrape.content soup = BeautifulSoup(html, 'html.parser') except BaseException as e: exceptions.append(e) print(e) pass