Python Forum

Full Version: Intro to WebScraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.

I am starting to create a webscraper for work, I am very inexperienced with Python and I am running into a strange issue. I am starting with very basic concepts, for example, finding and pulling the title from a webpage. When I use a HTTP site like I can pull the title no problem. However, when I use a website starting with HTTPS I get the following error:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Here is my code if that helps, I have replaced my proxy names with fake values. I am also using Python 3.7, any guidance is much appreciated!

 import requests
from bs4 import BeautifulSoup

http_proxy = '
https_proxy = '

url = ''

proxies = {
    'http': http_proxy,
    'https': https_proxy

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

r = requests.get(url, headers = headers, proxies = proxies)

soup = BeautifulSoup(r.text, 'html.parser')

results = soup.find('title')

products = results.text.strip()


So I've done some experimenting and I have found some HTTPS websites that I can work with. Any leads on why StockX won't work would still be valuable.
I tried the following which worked:
import requests
from  bs4 import BeautifulSoup

def try_scrape():
    url = ''
    page = None

    response = requests.get(url)
    if response.status_code == 200:
        page = response.content
    if page:
        soup = BeautifulSoup(page, 'lxml')
        print('Problem fetching page')
    results = soup.find('title')

if __name__ == '__main__':
(venv) Larz60p@linux-nnem: forum:$python <title>StockX: Buy and Sell Sneakers, Streetwear, Handbags, Watches</title> (venv) Larz60p@linux-nnem: forum:$
i vaguely remember having this same issue when using a proxy. I just cant recall what the solution was. I think i was using SOCKS5 though to proxy to my server. I am not sure if you are using a public crap one or what.

If you cannot find an answer here i would ask on the requests module github issues. You can scan the related issues there and maybe find an existing solution, otherwise start a new issue