Intro to WebScraping

d1rjr03 · (This post was last modified: Aug-14-2018, 07:25 PM by d1rjr03.)

Hello,

I am starting to create a webscraper for work, I am very inexperienced with Python and I am running into a strange issue. I am starting with very basic concepts, for example, finding and pulling the title from a webpage. When I use a HTTP site like Wikipedia.org I can pull the title no problem. However, when I use a website starting with HTTPS I get the following error:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Here is my code if that helps, I have replaced my proxy names with fake values. I am also using Python 3.7, any guidance is much appreciated!

 import requests
from bs4 import BeautifulSoup


http_proxy = 'http://abcd.org:1234
https_proxy = 'http://abcd.org:1234

url = 'https://www.stockx.com'


proxies = {
    'http': http_proxy,
    'https': https_proxy
}

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}



r = requests.get(url, headers = headers, proxies = proxies)



soup = BeautifulSoup(r.text, 'html.parser')


results = soup.find('title')

products = results.text.strip()

print(products)

So I've done some experimenting and I have found some HTTPS websites that I can work with. Any leads on why StockX won't work would still be valuable.

**Larz60+** · Aug-14-2018, 10:36 PM

I tried the following which worked:

import requests
from  bs4 import BeautifulSoup

def try_scrape():
    url = 'https://www.stockx.com'
    page = None

    response = requests.get(url)
    if response.status_code == 200:
        page = response.content
    
    if page:
        soup = BeautifulSoup(page, 'lxml')
    else:
        print('Problem fetching page')
    
    results = soup.find('title')
    print(results)

if __name__ == '__main__':
    try_scrape()

results:

Output:(venv) Larz60p@linux-nnem: forum:$python d1rjr03.py
<title>StockX: Buy and Sell Sneakers, Streetwear, Handbags, Watches</title>
(venv) Larz60p@linux-nnem: forum:$

***metulburr*** · (This post was last modified: Aug-15-2018, 12:06 AM by metulburr.)

i vaguely remember having this same issue when using a proxy. I just cant recall what the solution was. I think i was using SOCKS5 though to proxy to my server. I am not sure if you are using a public crap one or what.

If you cannot find an answer here i would ask on the requests module github issues. You can scan the related issues there and maybe find an existing solution, otherwise start a new issue
https://github.com/requests/requests/iss...s%3Aclosed

bobprogrammer · (This post was last modified: Dec-16-2024, 07:08 AM by buran.)

Hello,

It seems like you’re encountering issues when scraping HTTPS websites using a proxy. The error ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host is likely due to how your proxy or target server is handling HTTPS requests. Here's some guidance to help troubleshoot:

Potential Causes:
Proxy Issues: Public proxies often have reliability issues or restrictions with HTTPS connections. Ensure the proxy you're using supports HTTPS and isn’t rate-limited or blocked by the target website. For reliable results, consider using a professional proxy service like link removed, which provides stable, high-speed residential and data center proxies.
Target Website Blocking: Websites like StockX may implement security measures, such as bot detection or IP blocking, that prevent access through certain proxies or non-standard headers.
SSL/TLS Handshake: Some proxies mishandle secure SSL/TLS connections, leading to abrupt disconnections.
Debugging Steps:
Test Without Proxy: Try running your code without the proxy to confirm if the issue is proxy-specific:

python
Copy code
r = requests.get(url, headers=headers)
print(r.status_code)
print(r.text)

If it works without the proxy, the issue is likely related to the proxy configuration or the proxy itself.

Use a SOCKS5 Proxy: If you’re using an HTTP proxy, consider switching to a SOCKS5 proxy. SOCKS5 supports more protocols and is generally better for scraping. For SOCKS5, you can use the requests[socks] library:

bash
Copy code
pip install requests[socks]

Then modify your code:

python
Copy code
proxies = {
    'http': 'socks5h://abcd.org:1234',
    'https': 'socks5h://abcd.org:1234'
}

Check the Proxy:

Ensure the proxy is active and supports HTTPS by testing it with another HTTPS website.
Use a tool like curl or Postman to manually test your proxy:

bash
Copy code
curl -x http://abcd.org:1234 https://www.google.com

Rotate User-Agent:
Websites like StockX may block certain User-Agents. Use a random User-Agent for each request to avoid detection:

python
Copy code
headers = {'User-Agent': 'Your custom User-Agent string'}

Simplified Code for Testing:
Here’s a minimal version of your script to check connectivity:

python
Copy code
import requests
from bs4 import BeautifulSoup

url = 'https://www.stockx.com'

try:
    response = requests.get(url)
    response.raise_for_status()  # Check for HTTP errors
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.find('title').text)
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

Why the Alternative Code Worked:
In the alternative code you shared, no proxy was used. It’s possible the issue lies in how the proxy interacts with HTTPS connections or how StockX blocks certain proxies.

If the problem persists, you can explore the Requests GitHub Issues page or use more advanced tools like selenium or scrapy to handle dynamic content and stricter blocking mechanisms.

Larz60+ write Dec-16-2024, 04:58 AM:
clickbait link removed -- this is your second notice --

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Webscraping - loop on first page	RikP	0	815	Jul-22-2024, 12:15 PM Last Post: RikP
	Webscraping news articles by using selenium	cate16	7	6,328	Aug-28-2023, 09:58 AM Last Post: snippsat
	Webscraping with beautifulsoup	cormanstan	3	8,811	Aug-24-2023, 11:57 AM Last Post: snippsat
	Webscraping returning empty table	Buuuwq	0	2,620	Dec-09-2022, 10:41 AM Last Post: Buuuwq
	WebScraping using Selenium library	Korgik	0	1,686	Dec-09-2022, 09:51 AM Last Post: Korgik
	How to get rid of numerical tokens in output (webscraping issue)?	jps2020	0	2,566	Oct-26-2020, 05:37 PM Last Post: jps2020
	Python Webscraping with a Login Website	warriordazza	0	3,432	Jun-07-2020, 07:04 AM Last Post: warriordazza
	Help with basic webscraping	Captain_Snuggle	2	5,466	Nov-07-2019, 08:07 PM Last Post: kozaizsvemira
	Can't Resolve Webscraping AttributeError	Hass	1	3,130	Jan-15-2019, 09:36 PM Last Post: nilamo
	How to exclude certain links while webscraping basis on keywords	Prince_Bhatia	0	3,976	Oct-31-2018, 07:00 AM Last Post: Prince_Bhatia

Intro to WebScraping

User Panel Messages

Announcements