Python Forum

Full Version: MaxRetryError while scraping a website multiple times
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I have been trying to retrieve some data from a website. My initial test for retrieving one time data worked as expected, but when I try to get the data from 2 or more links of the same website I receive the below error.
I am new to webscraping, and I am doing it using BeautifulSoup, Requests, Selenium and Pandas. I also added a timer for sleeping between queries. Any idea of the root cause and a possible workaround for it?


Error:
MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=57192): Max retries exceeded with url: /session/dcba731d4173518f03b593a17afe111c/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000001DD4212470>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
Thanks
please show code
The code that I am using for this is below:

from bs4 import BeautifulSoup
import requests, io
import pandas as pd
from selenium import webdriver
import time

################## NOTE THAT THIS CODE WORKS FOR 1 LINK AT A TIME, FOR MORE THAN ONE IT FAILS
############# error: MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=57192): Max retries exceeded with url: /session/dcba731d4173518f03b593a17afe111c/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000001DD4212470>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

driver = webdriver.Chrome(executable_path=r"myfolder\chromedriver.exe")
uchar=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','<','>',',']
timestamp = pd.datetime.today().strftime('%Y%m%d-&H&M&S')

links_df = pd.read_excel(r'myfolder\myfile.xlsx', sheetname='Hoja1')
links_Df = links_df[(links_df['Country'] == 'PT')]

results = pd.DataFrame(columns=['ISIN', 'N Shares', 'Link'])

for ISIN in links_df.ISIN:
    link='https://www.bolsadelisboa.com.pt/pt-pt/products/equities/' + ISIN + '-XLIS/market-information'
    driver.get(link)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    r=soup.find_all("strong")[14]
    dirtyresult=str(r)
    for x in uchar:
        cleanresult=dirtyresult.replace(x,"").replace("<strong>","").replace("</strong>","")
    time.sleep(30)
    
results = results.append({'ISIN': ISIN, 'N Shares': cleanresult, 'Link': link}, ignore_index=True)
print(ISIN +": " + cleanresult)
    
results.to_csv(r'myfolder\output' + timestamp + '.csv', index=False)

print('Finish')
The error indicates that the server will not allow multiple access
you probably have to close the first connection before the second is attempted.
Hi Larz,

Isn't the connection closed with the below code?

driver.quit()
Yes, I believe so.
So you need to restart browser for next iteration
OK, it looks like by putting the below code for each iterations it works

driver = webdriver.Chrome(executable_path=r"myfolder\chromedriver.exe")
Thanks