Python Forum

I've been cycling through a list of 1000 URLs and scraping the source.. everything works great, but every once in a while I hit a problematic URL that keeps timing out over and over.

HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Max retries exceeded with url: /dc-assures-all-help-to-family-of-iraq-victim-balwant-rai/states/news/1188269.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x05F512B0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)
HTTPConnectionPool(host='www.uniindia.com', port=80): Read timed out. (read timeout=15)

Shouldn't it just timeout ONCE then move on?

What is going on here?

    
try:
    scrape = requests.get(df.iloc[list_counter, 0], headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"}, timeout=15)
    html = scrape.content
    soup = BeautifulSoup(html, 'html.parser')

except BaseException as e:
    exceptions.append(e)
    print(e)
    pass

I tried this and it appears to work:

import requests


class GetPage:
    def __init__(self):
        self.page = None
        self.status_ok = 200

    def get_this_page(self, url):
        response = requests.get(url)
        if response.status_code == self.status_ok:
            return response.content
        else:
            print(f'Error encountered: {response.status_code}')
            return None

def testit():
    gp = GetPage()
    document = gp.get_this_page('http://www.uniindia.com')
    if document is None:
        print('Error retrieving document')
    else:
        print(document)

if __name__ == '__main__':
    testit()

digitalmatic7

Larz60+