Posts: 1,090
Threads: 143
Joined: Jul 2017
Is there a way to circumvent the access denied page?
I am trying to find importers of agricultural chemicals in various countries for my girlfriend, so she can contact them.
import requests
from bs4 import BeautifulSoup
mylink = "https://www.distrilist.eu/cis/russia/39-import-export-companies-in-russia/"
res = requests.get(mylink) I get:
Output: res.text
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access this resource.</p>\n</body></html>\n'
I can open the webpage normally on my computer and see the table I want to get.
I can open the source code and copy and paste the data I want, but it would be nicer with BeautifulSoup! Put the output in a pandas dataframe and export to Excel!
Posts: 7,313
Threads: 123
Joined: Sep 2016
Jun-13-2024, 04:34 PM
(This post was last modified: Jun-13-2024, 04:35 PM by snippsat.)
Add User Agent.
import requests
from bs4 import BeautifulSoup
mylink = "https://www.distrilist.eu/cis/russia/39-import-export-companies-in-russia/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(mylink, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.select_one('header > h1')) Output: <h1 class="entry-title">39 Import Export Companies in Russia</h1>
Pedroski55 likes this post
Posts: 2,125
Threads: 11
Joined: May 2017
Jun-13-2024, 05:12 PM
(This post was last modified: Jun-13-2024, 05:12 PM by DeaD_EyE.)
import csv
import socket
# external dependency
import requests
from bs4 import BeautifulSoup
proxies = {}
# do not use this code
try:
import socks
# if you want to use tor via socks, then install pysocks
# the import name is socks, but via pip you have to install pysocks
# or install requests[socks]
# the square brackets are used often for additional dependencies for
# packages
except ImportError:
print("Could not use pysocks, so TOR could not used as a proxy")
# not using proxis if socks is not installed
else:
with socket.socket() as sock:
sock.settimeout(1)
try:
sock.connect(("127.0.0.1", 9050))
except (TimeoutError, ConnectionError):
print("TOR service seems not running")
else:
# using socks5 and the h signals the use of the dns provided via tor
proxies = {"http": "socks5h://127.0.0.1:9050"}
proxies["https"] = proxies["http"]
# end of tor
def get_export_companies():
url = "https://www.distrilist.eu/cis/russia/39-import-export-companies-in-russia/"
# some webservers disallows the access if no valid User-Agent were send
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"}
response = requests.get(url, headers=headers, proxies=proxies)
# parse the raw content (bytes) of the response with bs4
doc = BeautifulSoup(response.content, "html.parser")
# table_header is the first element
# table_rows is a list of table rows
table_header, *table_rows = doc.find_all("tr")
# yield the header as title
# inside the tuple is a generator exression to get the text and
# calling the method title on the str
yield tuple(td.text.title() for td in table_header.find_all("td"))
# yield the table rows
for table_row in table_rows:
# yield a tuple with the table data
# inside the tuple is a generator exression to get the text
yield tuple(table_data.text for table_data in table_row.find_all("td"))
def save_csv(file):
with open(file, "w", encoding="utf8") as fd:
csv.writer(fd).writerows(get_export_companies())
save_csv("export_companies.csv")
Pedroski55 likes this post
Posts: 1,090
Threads: 143
Joined: Jul 2017
Thanks both of you!
I did figure out the bit with User Agent in the end:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(mylink, headers=headers) This morning another, related problem. Again, I can open the webpage in my browser and see the page source code and the information I want to save, but I am getting the following error when I try to get the page with requests:
Quote:raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.schneider-group.com', port=443): Max retries exceeded with url: /en/about/contacts/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))
I there anything I can do about this error?
companyweb = "https://www.schneider-group.com/en/about/contacts/"
result = requests.get(companyweb, headers=headers) The contact page has various addresses, emails and phone numbers of offices in West and East Europe.
Posts: 2,125
Threads: 11
Joined: May 2017
Try this:
result = requests.get(companyweb, verfify=False) But this is only a workaround. Usually requests installs all required ca-certificates as a bundle to verify the certs from the webserver. If the verification is disabled, then there is no check.
Posts: 1,090
Threads: 143
Joined: Jul 2017
@ DeaD_EyE Thanks again!
Quote:result = requests.get(companyweb, verify=False)
Warning (from warnings module):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1020
warnings.warn(
InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.schneider-group.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest...l-warnings
Oh well, I will try the next company!
Posts: 7,313
Threads: 123
Joined: Sep 2016
Jun-14-2024, 05:08 PM
(This post was last modified: Jun-14-2024, 05:08 PM by snippsat.)
Using verify=False will get warning this is not a stopping error,can still parse fine.
Can also use eg Selenium then no warning,and want parse JavaScript content then it also work.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
# Setup
# https://storage.googleapis.com/chrome-for-testing-public/125.0.6422.141/win64/chromedriver-win64.zip
options = Options()
options.add_argument("--headless=new")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
# Parse or automation
url = "https://www.schneider-group.com/en/about/contacts/"
browser.get(url)
Armenia = browser.find_element(By.CSS_SELECTOR, '#bx_3218110189_13 > p')
print(Armenia.text) Output: Business Center "Yerevan Plaza", Grigor Lusavorich str. 9, Yerevan, 0015, Armenia
Pedroski55 likes this post
Posts: 1,090
Threads: 143
Joined: Jul 2017
@ snippsat Thanks!
You are right, this
result = requests.get(companyweb, headers=headers, verify=False) got me the warning:
Output: Warning (from warnings module):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1020
warnings.warn(
InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.schneider-group.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
but also:
Output: result
<Response [200]>
result.text
Squeezed text (545 lines)
Thanks for the selenium tip, I will try it!
|