Python Forum

Hello,

I was trying to learn web scraping for data science/collection but have been struggling to use urllib (urlopen) with wikipedia in any shape or form.

I'm trying to follow someone explaining web scraping using the same webpage, which is https://en.wikipedia.org/wiki/Genome but I can't even get passed the first part! It's kind of driving me mad.

At the moment this is what I have been following to learn about web scraping:

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen

wiki_url = 'http://www.wikipedia.org/'
wiki_data = urlopen(wiki_url)
wiki_html = wiki_data.read()
wiki_data.close()
page_soup = bs(wiki_html, 'html.parser')
print(page_soup)

This code throws the following error:

---------------------------------------------------------------------------
SSLCertVerificationError                  Traceback (most recent call last)
~\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
   1316                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error

~\Anaconda3\lib\http\client.py in request(self, method, url, body, headers, encode_chunked)
   1228         """Send a complete request to the server."""
-> 1229         self._send_request(method, url, body, headers, encode_chunked)
   1230 

~\Anaconda3\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1274             body = _encode(body, 'body')
-> 1275         self.endheaders(body, encode_chunked=encode_chunked)
   1276 

~\Anaconda3\lib\http\client.py in endheaders(self, message_body, encode_chunked)
   1223             raise CannotSendHeader()
-> 1224         self._send_output(message_body, encode_chunked=encode_chunked)
   1225 

~\Anaconda3\lib\http\client.py in _send_output(self, message_body, encode_chunked)
   1015         del self._buffer[:]
-> 1016         self.send(msg)
   1017 

~\Anaconda3\lib\http\client.py in send(self, data)
    955             if self.auto_open:
--> 956                 self.connect()
    957             else:

~\Anaconda3\lib\http\client.py in connect(self)
   1391             self.sock = self._context.wrap_socket(self.sock,
-> 1392                                                   server_hostname=server_hostname)
   1393 

~\Anaconda3\lib\ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    411             context=self,
--> 412             session=session
    413         )

~\Anaconda3\lib\ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
    852                         raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 853                     self.do_handshake()
    854             except (OSError, ValueError):

~\Anaconda3\lib\ssl.py in do_handshake(self, block)
   1116                 self.settimeout(None)
-> 1117             self._sslobj.do_handshake()
   1118         finally:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1051)

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-37-8d9d978790db> in <module>
      1 wiki_url = 'https://en.wikipedia.org/wiki/Genome'
----> 2 wiki_data = urlopen(wiki_url)
      3 wiki_html = wiki_data.read()
      4 wiki_data.close()
      5 page_soup = soup(wiki_html, 'html.parser')

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    523             req = meth(req)
    524 
--> 525         response = self._open(req, data)
    526 
    527         # post-process response

~\Anaconda3\lib\urllib\request.py in _open(self, req, data)
    541         protocol = req.type
    542         result = self._call_chain(self.handle_open, protocol, protocol +
--> 543                                   '_open', req)
    544         if result:
    545             return result

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~\Anaconda3\lib\urllib\request.py in https_open(self, req)
   1358         def https_open(self, req):
   1359             return self.do_open(http.client.HTTPSConnection, req,
-> 1360                 context=self._context, check_hostname=self._check_hostname)
   1361 
   1362         https_request = AbstractHTTPHandler.do_request_

~\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
   1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error
-> 1319                 raise URLError(err)
   1320             r = h.getresponse()
   1321         except:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1051)>

I tried looking all over google, but the only information I could find in regards to it was either for amazon, mac computers, or the fix didn't solve my issue. The following code I have written does pull off of other websites, I tested it with google, amazon.com and random websites. It is just in regards to the certificate for wikipedia.

This is just annoying. I am on a laptop running windows 10, I'm using anaconda3 with jupyter notebook, and have haven't learned much about python yet (still in the process of understanding the programming aspect so this secondary issue is too complicated for me at the moment).

I will be forever indebted to whoever helps me! If you need anymore information just let me know.

Thanks in advance,

FalseFact

EDIT: So I saw there was a python forum interpreter so I decided to check if the code worked there and it did. So I know that this is more than likely something I'm missing on my end but cannot find a solution to it. I hope someone else has an idea for this! THANKS!

Sorry for the spam post I couldn't edit this post,

but I just wanted to say that my failing to search the forum created this unnecessary post and I apologize.

I didn't know that requests was used now instead of urllib.

but I ended up putting my code as such using requests to get the same outcome I was trying to achieve before:

from bs4 import BeautifulSoup as soup
import requests

wiki_url = 'https://en.wikipedia.org/wiki/Genome'
wiki_html = requests.get(wiki_url, verify=True)

page_soup = soup(wiki_html.text, 'html.parser')
print(page_soup)

Thanks, and sorry again for posting this if someone can delete this post that would be great otherwise I marked it as solved.

FalseFact

(Mar-31-2019, 06:24 AM)FalseFact Wrote: [ -> ]Thanks, and sorry again for posting this if someone can delete this post that would be great otherwise I marked it as solved.

We do not delete post,and this can also be helpful for other like using Requests instead of urllib.

Can also use wiki_html.content instead of wiki_html.text.
As BS will convert to Unicode,so no need to use text from Requests that dos the same.

bs4 Wrote:Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode.

FalseFact

snippsat