Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Intro to WebScraping
#1
Hello,

I am starting to create a webscraper for work, I am very inexperienced with Python and I am running into a strange issue. I am starting with very basic concepts, for example, finding and pulling the title from a webpage. When I use a HTTP site like Wikipedia.org I can pull the title no problem. However, when I use a website starting with HTTPS I get the following error:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Here is my code if that helps, I have replaced my proxy names with fake values. I am also using Python 3.7, any guidance is much appreciated!

 import requests
from bs4 import BeautifulSoup


http_proxy = 'http://abcd.org:1234
https_proxy = 'http://abcd.org:1234

url = 'https://www.stockx.com'


proxies = {
    'http': http_proxy,
    'https': https_proxy
}

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}



r = requests.get(url, headers = headers, proxies = proxies)



soup = BeautifulSoup(r.text, 'html.parser')


results = soup.find('title')

products = results.text.strip()

print(products)

So I've done some experimenting and I have found some HTTPS websites that I can work with. Any leads on why StockX won't work would still be valuable.
Reply
#2
I tried the following which worked:
import requests
from  bs4 import BeautifulSoup

def try_scrape():
    url = 'https://www.stockx.com'
    page = None

    response = requests.get(url)
    if response.status_code == 200:
        page = response.content
    
    if page:
        soup = BeautifulSoup(page, 'lxml')
    else:
        print('Problem fetching page')
    
    results = soup.find('title')
    print(results)

if __name__ == '__main__':
    try_scrape()
results:
Output:
(venv) Larz60p@linux-nnem: forum:$python d1rjr03.py <title>StockX: Buy and Sell Sneakers, Streetwear, Handbags, Watches</title> (venv) Larz60p@linux-nnem: forum:$
Reply
#3
i vaguely remember having this same issue when using a proxy. I just cant recall what the solution was. I think i was using SOCKS5 though to proxy to my server. I am not sure if you are using a public crap one or what.

If you cannot find an answer here i would ask on the requests module github issues. You can scan the related issues there and maybe find an existing solution, otherwise start a new issue
https://github.com/requests/requests/iss...s%3Aclosed
Recommended Tutorials:
Reply
#4
Hello,

It seems like you’re encountering issues when scraping HTTPS websites using a proxy. The error ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host is likely due to how your proxy or target server is handling HTTPS requests. Here's some guidance to help troubleshoot:

Potential Causes:
Proxy Issues: Public proxies often have reliability issues or restrictions with HTTPS connections. Ensure the proxy you're using supports HTTPS and isn’t rate-limited or blocked by the target website. For reliable results, consider using a professional proxy service like link removed, which provides stable, high-speed residential and data center proxies.
Target Website Blocking: Websites like StockX may implement security measures, such as bot detection or IP blocking, that prevent access through certain proxies or non-standard headers.
SSL/TLS Handshake: Some proxies mishandle secure SSL/TLS connections, leading to abrupt disconnections.
Debugging Steps:
Test Without Proxy: Try running your code without the proxy to confirm if the issue is proxy-specific:

python
Copy code
r = requests.get(url, headers=headers)
print(r.status_code)
print(r.text)
If it works without the proxy, the issue is likely related to the proxy configuration or the proxy itself.

Use a SOCKS5 Proxy: If you’re using an HTTP proxy, consider switching to a SOCKS5 proxy. SOCKS5 supports more protocols and is generally better for scraping. For SOCKS5, you can use the requests[socks] library:

bash
Copy code
pip install requests[socks]
Then modify your code:

python
Copy code
proxies = {
    'http': 'socks5h://abcd.org:1234',
    'https': 'socks5h://abcd.org:1234'
}
Check the Proxy:

Ensure the proxy is active and supports HTTPS by testing it with another HTTPS website.
Use a tool like curl or Postman to manually test your proxy:
bash
Copy code
curl -x http://abcd.org:1234 https://www.google.com
Rotate User-Agent:
Websites like StockX may block certain User-Agents. Use a random User-Agent for each request to avoid detection:

python
Copy code
headers = {'User-Agent': 'Your custom User-Agent string'}
Simplified Code for Testing:
Here’s a minimal version of your script to check connectivity:

python
Copy code
import requests
from bs4 import BeautifulSoup

url = 'https://www.stockx.com'

try:
    response = requests.get(url)
    response.raise_for_status()  # Check for HTTP errors
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.find('title').text)
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")
Why the Alternative Code Worked:
In the alternative code you shared, no proxy was used. It’s possible the issue lies in how the proxy interacts with HTTPS connections or how StockX blocks certain proxies.

If the problem persists, you can explore the Requests GitHub Issues page or use more advanced tools like selenium or scrapy to handle dynamic content and stricter blocking mechanisms.
Larz60+ write Dec-16-2024, 04:58 AM:
clickbait link removed -- this is your second notice --
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscraping - loop on first page RikP 0 806 Jul-22-2024, 12:15 PM
Last Post: RikP
  Webscraping news articles by using selenium cate16 7 6,316 Aug-28-2023, 09:58 AM
Last Post: snippsat
  Webscraping with beautifulsoup cormanstan 3 8,790 Aug-24-2023, 11:57 AM
Last Post: snippsat
  Webscraping returning empty table Buuuwq 0 2,619 Dec-09-2022, 10:41 AM
Last Post: Buuuwq
  WebScraping using Selenium library Korgik 0 1,686 Dec-09-2022, 09:51 AM
Last Post: Korgik
  How to get rid of numerical tokens in output (webscraping issue)? jps2020 0 2,564 Oct-26-2020, 05:37 PM
Last Post: jps2020
  Python Webscraping with a Login Website warriordazza 0 3,431 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Help with basic webscraping Captain_Snuggle 2 5,462 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  Can't Resolve Webscraping AttributeError Hass 1 3,126 Jan-15-2019, 09:36 PM
Last Post: nilamo
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 3,975 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020