Python Forum
Circumvent the "access denied" page?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Circumvent the "access denied" page?
#1
Is there a way to circumvent the access denied page?

I am trying to find importers of agricultural chemicals in various countries for my girlfriend, so she can contact them.

import requests
from bs4 import BeautifulSoup

mylink = "https://www.distrilist.eu/cis/russia/39-import-export-companies-in-russia/"
res = requests.get(mylink)
I get:

Output:
res.text '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access this resource.</p>\n</body></html>\n'
I can open the webpage normally on my computer and see the table I want to get.

I can open the source code and copy and paste the data I want, but it would be nicer with BeautifulSoup! Put the output in a pandas dataframe and export to Excel!
Reply
#2
Add User Agent.
import requests
from bs4 import BeautifulSoup

mylink = "https://www.distrilist.eu/cis/russia/39-import-export-companies-in-russia/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(mylink, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.select_one('header > h1'))
Output:
<h1 class="entry-title">39 Import Export Companies in Russia</h1>
Pedroski55 likes this post
Reply
#3
import csv
import socket

# external dependency
import requests
from bs4 import BeautifulSoup



proxies = {}

# do not use this code
try:
    import socks
    # if you want to use tor via socks, then install pysocks
    # the import name is socks, but via pip you have to install pysocks
    # or install requests[socks]
    # the square brackets are used often for additional dependencies for
    # packages
except ImportError:
    print("Could not use pysocks, so TOR could not used as a proxy")
    # not using proxis if socks is not installed
else:
    with socket.socket() as sock:
        sock.settimeout(1)

        try:
            sock.connect(("127.0.0.1", 9050))
        except (TimeoutError, ConnectionError):
            print("TOR service seems not running")
        else:
            # using socks5 and the h signals the use of the dns provided via tor
            proxies = {"http": "socks5h://127.0.0.1:9050"}
            proxies["https"] = proxies["http"]
# end of tor

def get_export_companies():
    url = "https://www.distrilist.eu/cis/russia/39-import-export-companies-in-russia/"
    
    # some webservers disallows the access if no valid User-Agent were send
    headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"}
    response = requests.get(url, headers=headers, proxies=proxies)

    # parse the raw content (bytes) of the response with bs4
    doc = BeautifulSoup(response.content, "html.parser")
    
    # table_header is the first element
    # table_rows is a list of table rows
    table_header, *table_rows = doc.find_all("tr")
    
    # yield the header as title
    # inside the tuple is a generator exression to get the text and
    # calling the method title on the str
    yield tuple(td.text.title() for td in table_header.find_all("td"))

    # yield the table rows
    for table_row in table_rows:
        # yield a tuple with the table data
        # inside the tuple is a generator exression to get the text
        yield tuple(table_data.text for table_data in table_row.find_all("td"))


def save_csv(file):
    with open(file, "w", encoding="utf8") as fd:
        csv.writer(fd).writerows(get_export_companies())
        

save_csv("export_companies.csv")
Pedroski55 likes this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
Thanks both of you!

I did figure out the bit with User Agent in the end:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(mylink, headers=headers)
This morning another, related problem. Again, I can open the webpage in my browser and see the page source code and the information I want to save, but I am getting the following error when I try to get the page with requests:

Quote:raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.schneider-group.com', port=443): Max retries exceeded with url: /en/about/contacts/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))

I there anything I can do about this error?

companyweb = "https://www.schneider-group.com/en/about/contacts/"
result = requests.get(companyweb, headers=headers)
The contact page has various addresses, emails and phone numbers of offices in West and East Europe.
Reply
#5
Try this:
result = requests.get(companyweb, verfify=False)
But this is only a workaround. Usually requests installs all required ca-certificates as a bundle to verify the certs from the webserver. If the verification is disabled, then there is no check.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
@DeaD_EyE Thanks again!

Quote:result = requests.get(companyweb, verify=False)

Warning (from warnings module):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1020
warnings.warn(
InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.schneider-group.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest...l-warnings

Oh well, I will try the next company!
Reply
#7
Using verify=False will get warning this is not a stopping error,can still parse fine.
Can also use eg Selenium then no warning,and want parse JavaScript content then it also work.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

# Setup
# https://storage.googleapis.com/chrome-for-testing-public/125.0.6422.141/win64/chromedriver-win64.zip
options = Options()
options.add_argument("--headless=new")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
# Parse or automation
url = "https://www.schneider-group.com/en/about/contacts/"
browser.get(url)
Armenia = browser.find_element(By.CSS_SELECTOR, '#bx_3218110189_13 > p')
print(Armenia.text)
Output:
Business Center "Yerevan Plaza", Grigor Lusavorich str. 9, Yerevan, 0015, Armenia
Pedroski55 likes this post
Reply
#8
@snippsat Thanks!

You are right, this

result = requests.get(companyweb, headers=headers, verify=False)
got me the warning:

Output:
Warning (from warnings module): File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1020 warnings.warn( InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.schneider-group.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
but also:

Output:
result <Response [200]> result.text Squeezed text (545 lines)
Thanks for the selenium tip, I will try it!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  sharepoint: Access has been blocked by Conditional Access policies CAD79 0 2,090 Jul-12-2024, 09:36 AM
Last Post: CAD79
  The INSERT permission was denied on the object Steven5055 3 2,934 Jun-12-2024, 08:13 AM
Last Post: GregoryConley
  access is denied error 5 for network drive mapping ? ahmedbarbary 2 2,765 Aug-17-2022, 10:09 PM
Last Post: ahmedbarbary
  Server Folder Error : WinError5 Access Denied fioranosnake 1 1,708 Jun-21-2022, 11:11 PM
Last Post: Larz60+
  Error no 13: Permission denied in python shantanu97 1 8,538 Mar-31-2021, 02:15 PM
Last Post: snippsat
  Access denied not able to do anything 8( What do i do? MurphysLaw 3 5,086 Oct-20-2020, 08:16 AM
Last Post: snippsat
  os.remove() Access is denied pythonnewbie138 3 40,253 Aug-28-2020, 10:02 PM
Last Post: bowlofred
  Fixing "PermissionError: [Errno 13] Permission denied" puredata 17 88,636 Mar-09-2020, 03:20 PM
Last Post: syssy
  Has anyone experience a winError[5] Access Denied in Windows 10? fstkmaro 2 16,388 Nov-11-2019, 02:38 PM
Last Post: fstkmaro
  PermissionError: [Errno 13] Permission denied: error leviathan54 2 49,273 Apr-20-2019, 12:51 AM
Last Post: leviathan54

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020