Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Intro to WebScraping
#1
Hello,

I am starting to create a webscraper for work, I am very inexperienced with Python and I am running into a strange issue. I am starting with very basic concepts, for example, finding and pulling the title from a webpage. When I use a HTTP site like Wikipedia.org I can pull the title no problem. However, when I use a website starting with HTTPS I get the following error:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Here is my code if that helps, I have replaced my proxy names with fake values. I am also using Python 3.7, any guidance is much appreciated!

 import requests
from bs4 import BeautifulSoup


http_proxy = 'http://abcd.org:1234
https_proxy = 'http://abcd.org:1234

url = 'https://www.stockx.com'


proxies = {
    'http': http_proxy,
    'https': https_proxy
}

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}



r = requests.get(url, headers = headers, proxies = proxies)



soup = BeautifulSoup(r.text, 'html.parser')


results = soup.find('title')

products = results.text.strip()

print(products)

So I've done some experimenting and I have found some HTTPS websites that I can work with. Any leads on why StockX won't work would still be valuable.
Reply
#2
I tried the following which worked:
import requests
from  bs4 import BeautifulSoup

def try_scrape():
    url = 'https://www.stockx.com'
    page = None

    response = requests.get(url)
    if response.status_code == 200:
        page = response.content
    
    if page:
        soup = BeautifulSoup(page, 'lxml')
    else:
        print('Problem fetching page')
    
    results = soup.find('title')
    print(results)

if __name__ == '__main__':
    try_scrape()
results:
Output:
(venv) Larz60p@linux-nnem: forum:$python d1rjr03.py <title>StockX: Buy and Sell Sneakers, Streetwear, Handbags, Watches</title> (venv) Larz60p@linux-nnem: forum:$
Reply
#3
i vaguely remember having this same issue when using a proxy. I just cant recall what the solution was. I think i was using SOCKS5 though to proxy to my server. I am not sure if you are using a public crap one or what.

If you cannot find an answer here i would ask on the requests module github issues. You can scan the related issues there and maybe find an existing solution, otherwise start a new issue
https://github.com/requests/requests/iss...s%3Aclosed
Recommended Tutorials:
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscraping news articles by using selenium cate16 7 2,959 Aug-28-2023, 09:58 AM
Last Post: snippsat
  Webscraping with beautifulsoup cormanstan 3 1,852 Aug-24-2023, 11:57 AM
Last Post: snippsat
  Webscraping returning empty table Buuuwq 0 1,351 Dec-09-2022, 10:41 AM
Last Post: Buuuwq
  WebScraping using Selenium library Korgik 0 1,021 Dec-09-2022, 09:51 AM
Last Post: Korgik
  How to get rid of numerical tokens in output (webscraping issue)? jps2020 0 1,915 Oct-26-2020, 05:37 PM
Last Post: jps2020
  Python Webscraping with a Login Website warriordazza 0 2,571 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Help with basic webscraping Captain_Snuggle 2 3,876 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  Can't Resolve Webscraping AttributeError Hass 1 2,261 Jan-15-2019, 09:36 PM
Last Post: nilamo
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 3,199 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia
  Webscraping homework Ghigo1995 1 2,609 Sep-23-2018, 07:36 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020