Python Forum

Full Version: Intro to WebScraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I am starting to create a webscraper for work, I am very inexperienced with Python and I am running into a strange issue. I am starting with very basic concepts, for example, finding and pulling the title from a webpage. When I use a HTTP site like Wikipedia.org I can pull the title no problem. However, when I use a website starting with HTTPS I get the following error:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Here is my code if that helps, I have replaced my proxy names with fake values. I am also using Python 3.7, any guidance is much appreciated!

 import requests
from bs4 import BeautifulSoup


http_proxy = 'http://abcd.org:1234
https_proxy = 'http://abcd.org:1234

url = 'https://www.stockx.com'


proxies = {
    'http': http_proxy,
    'https': https_proxy
}

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}



r = requests.get(url, headers = headers, proxies = proxies)



soup = BeautifulSoup(r.text, 'html.parser')


results = soup.find('title')

products = results.text.strip()

print(products)

So I've done some experimenting and I have found some HTTPS websites that I can work with. Any leads on why StockX won't work would still be valuable.
I tried the following which worked:
import requests
from  bs4 import BeautifulSoup

def try_scrape():
    url = 'https://www.stockx.com'
    page = None

    response = requests.get(url)
    if response.status_code == 200:
        page = response.content
    
    if page:
        soup = BeautifulSoup(page, 'lxml')
    else:
        print('Problem fetching page')
    
    results = soup.find('title')
    print(results)

if __name__ == '__main__':
    try_scrape()
results:
Output:
(venv) Larz60p@linux-nnem: forum:$python d1rjr03.py <title>StockX: Buy and Sell Sneakers, Streetwear, Handbags, Watches</title> (venv) Larz60p@linux-nnem: forum:$
i vaguely remember having this same issue when using a proxy. I just cant recall what the solution was. I think i was using SOCKS5 though to proxy to my server. I am not sure if you are using a public crap one or what.

If you cannot find an answer here i would ask on the requests module github issues. You can scan the related issues there and maybe find an existing solution, otherwise start a new issue
https://github.com/requests/requests/iss...s%3Aclosed