Python Forum

Full Version: Intro to WebScraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I am starting to create a webscraper for work, I am very inexperienced with Python and I am running into a strange issue. I am starting with very basic concepts, for example, finding and pulling the title from a webpage. When I use a HTTP site like Wikipedia.org I can pull the title no problem. However, when I use a website starting with HTTPS I get the following error:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Here is my code if that helps, I have replaced my proxy names with fake values. I am also using Python 3.7, any guidance is much appreciated!

 import requests
from bs4 import BeautifulSoup


http_proxy = 'http://abcd.org:1234
https_proxy = 'http://abcd.org:1234

url = 'https://www.stockx.com'


proxies = {
    'http': http_proxy,
    'https': https_proxy
}

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}



r = requests.get(url, headers = headers, proxies = proxies)



soup = BeautifulSoup(r.text, 'html.parser')


results = soup.find('title')

products = results.text.strip()

print(products)

So I've done some experimenting and I have found some HTTPS websites that I can work with. Any leads on why StockX won't work would still be valuable.
I tried the following which worked:
import requests
from  bs4 import BeautifulSoup

def try_scrape():
    url = 'https://www.stockx.com'
    page = None

    response = requests.get(url)
    if response.status_code == 200:
        page = response.content
    
    if page:
        soup = BeautifulSoup(page, 'lxml')
    else:
        print('Problem fetching page')
    
    results = soup.find('title')
    print(results)

if __name__ == '__main__':
    try_scrape()
results:
Output:
(venv) Larz60p@linux-nnem: forum:$python d1rjr03.py <title>StockX: Buy and Sell Sneakers, Streetwear, Handbags, Watches</title> (venv) Larz60p@linux-nnem: forum:$
i vaguely remember having this same issue when using a proxy. I just cant recall what the solution was. I think i was using SOCKS5 though to proxy to my server. I am not sure if you are using a public crap one or what.

If you cannot find an answer here i would ask on the requests module github issues. You can scan the related issues there and maybe find an existing solution, otherwise start a new issue
https://github.com/requests/requests/iss...s%3Aclosed
Hello,

It seems like you’re encountering issues when scraping HTTPS websites using a proxy. The error ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host is likely due to how your proxy or target server is handling HTTPS requests. Here's some guidance to help troubleshoot:

Potential Causes:
Proxy Issues: Public proxies often have reliability issues or restrictions with HTTPS connections. Ensure the proxy you're using supports HTTPS and isn’t rate-limited or blocked by the target website. For reliable results, consider using a professional proxy service like link removed, which provides stable, high-speed residential and data center proxies.
Target Website Blocking: Websites like StockX may implement security measures, such as bot detection or IP blocking, that prevent access through certain proxies or non-standard headers.
SSL/TLS Handshake: Some proxies mishandle secure SSL/TLS connections, leading to abrupt disconnections.
Debugging Steps:
Test Without Proxy: Try running your code without the proxy to confirm if the issue is proxy-specific:

python
Copy code
r = requests.get(url, headers=headers)
print(r.status_code)
print(r.text)
If it works without the proxy, the issue is likely related to the proxy configuration or the proxy itself.

Use a SOCKS5 Proxy: If you’re using an HTTP proxy, consider switching to a SOCKS5 proxy. SOCKS5 supports more protocols and is generally better for scraping. For SOCKS5, you can use the requests[socks] library:

bash
Copy code
pip install requests[socks]
Then modify your code:

python
Copy code
proxies = {
    'http': 'socks5h://abcd.org:1234',
    'https': 'socks5h://abcd.org:1234'
}
Check the Proxy:

Ensure the proxy is active and supports HTTPS by testing it with another HTTPS website.
Use a tool like curl or Postman to manually test your proxy:
bash
Copy code
curl -x http://abcd.org:1234 https://www.google.com
Rotate User-Agent:
Websites like StockX may block certain User-Agents. Use a random User-Agent for each request to avoid detection:

python
Copy code
headers = {'User-Agent': 'Your custom User-Agent string'}
Simplified Code for Testing:
Here’s a minimal version of your script to check connectivity:

python
Copy code
import requests
from bs4 import BeautifulSoup

url = 'https://www.stockx.com'

try:
    response = requests.get(url)
    response.raise_for_status()  # Check for HTTP errors
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.find('title').text)
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")
Why the Alternative Code Worked:
In the alternative code you shared, no proxy was used. It’s possible the issue lies in how the proxy interacts with HTTPS connections or how StockX blocks certain proxies.

If the problem persists, you can explore the Requests GitHub Issues page or use more advanced tools like selenium or scrapy to handle dynamic content and stricter blocking mechanisms.