Mar-31-2019, 06:24 AM
Hello,
I was trying to learn web scraping for data science/collection but have been struggling to use urllib (urlopen) with wikipedia in any shape or form.
I'm trying to follow someone explaining web scraping using the same webpage, which is https://en.wikipedia.org/wiki/Genome but I can't even get passed the first part! It's kind of driving me mad.
At the moment this is what I have been following to learn about web scraping:
This is just annoying. I am on a laptop running windows 10, I'm using anaconda3 with jupyter notebook, and have haven't learned much about python yet (still in the process of understanding the programming aspect so this secondary issue is too complicated for me at the moment).
I will be forever indebted to whoever helps me! If you need anymore information just let me know.
Thanks in advance,
FalseFact
EDIT: So I saw there was a python forum interpreter so I decided to check if the code worked there and it did. So I know that this is more than likely something I'm missing on my end but cannot find a solution to it. I hope someone else has an idea for this! THANKS!
Sorry for the spam post I couldn't edit this post,
but I just wanted to say that my failing to search the forum created this unnecessary post and I apologize.
I didn't know that requests was used now instead of urllib.
but I ended up putting my code as such using requests to get the same outcome I was trying to achieve before:
FalseFact
I was trying to learn web scraping for data science/collection but have been struggling to use urllib (urlopen) with wikipedia in any shape or form.
I'm trying to follow someone explaining web scraping using the same webpage, which is https://en.wikipedia.org/wiki/Genome but I can't even get passed the first part! It's kind of driving me mad.
At the moment this is what I have been following to learn about web scraping:
from bs4 import BeautifulSoup as bs from urllib.request import urlopen wiki_url = 'http://www.wikipedia.org/' wiki_data = urlopen(wiki_url) wiki_html = wiki_data.read() wiki_data.close() page_soup = bs(wiki_html, 'html.parser') print(page_soup)This code throws the following error:
--------------------------------------------------------------------------- SSLCertVerificationError Traceback (most recent call last) ~\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args) 1316 h.request(req.get_method(), req.selector, req.data, headers, -> 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error ~\Anaconda3\lib\http\client.py in request(self, method, url, body, headers, encode_chunked) 1228 """Send a complete request to the server.""" -> 1229 self._send_request(method, url, body, headers, encode_chunked) 1230 ~\Anaconda3\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked) 1274 body = _encode(body, 'body') -> 1275 self.endheaders(body, encode_chunked=encode_chunked) 1276 ~\Anaconda3\lib\http\client.py in endheaders(self, message_body, encode_chunked) 1223 raise CannotSendHeader() -> 1224 self._send_output(message_body, encode_chunked=encode_chunked) 1225 ~\Anaconda3\lib\http\client.py in _send_output(self, message_body, encode_chunked) 1015 del self._buffer[:] -> 1016 self.send(msg) 1017 ~\Anaconda3\lib\http\client.py in send(self, data) 955 if self.auto_open: --> 956 self.connect() 957 else: ~\Anaconda3\lib\http\client.py in connect(self) 1391 self.sock = self._context.wrap_socket(self.sock, -> 1392 server_hostname=server_hostname) 1393 ~\Anaconda3\lib\ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session) 411 context=self, --> 412 session=session 413 ) ~\Anaconda3\lib\ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session) 852 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets") --> 853 self.do_handshake() 854 except (OSError, ValueError): ~\Anaconda3\lib\ssl.py in do_handshake(self, block) 1116 self.settimeout(None) -> 1117 self._sslobj.do_handshake() 1118 finally: SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1051) During handling of the above exception, another exception occurred: URLError Traceback (most recent call last) <ipython-input-37-8d9d978790db> in <module> 1 wiki_url = 'https://en.wikipedia.org/wiki/Genome' ----> 2 wiki_data = urlopen(wiki_url) 3 wiki_html = wiki_data.read() 4 wiki_data.close() 5 page_soup = soup(wiki_html, 'html.parser') ~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener): ~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout) 523 req = meth(req) 524 --> 525 response = self._open(req, data) 526 527 # post-process response ~\Anaconda3\lib\urllib\request.py in _open(self, req, data) 541 protocol = req.type 542 result = self._call_chain(self.handle_open, protocol, protocol + --> 543 '_open', req) 544 if result: 545 return result ~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args) 501 for handler in handlers: 502 func = getattr(handler, meth_name) --> 503 result = func(*args) 504 if result is not None: 505 return result ~\Anaconda3\lib\urllib\request.py in https_open(self, req) 1358 def https_open(self, req): 1359 return self.do_open(http.client.HTTPSConnection, req, -> 1360 context=self._context, check_hostname=self._check_hostname) 1361 1362 https_request = AbstractHTTPHandler.do_request_ ~\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args) 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error -> 1319 raise URLError(err) 1320 r = h.getresponse() 1321 except: URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1051)>I tried looking all over google, but the only information I could find in regards to it was either for amazon, mac computers, or the fix didn't solve my issue. The following code I have written does pull off of other websites, I tested it with google, amazon.com and random websites. It is just in regards to the certificate for wikipedia.
This is just annoying. I am on a laptop running windows 10, I'm using anaconda3 with jupyter notebook, and have haven't learned much about python yet (still in the process of understanding the programming aspect so this secondary issue is too complicated for me at the moment).
I will be forever indebted to whoever helps me! If you need anymore information just let me know.
Thanks in advance,
FalseFact
EDIT: So I saw there was a python forum interpreter so I decided to check if the code worked there and it did. So I know that this is more than likely something I'm missing on my end but cannot find a solution to it. I hope someone else has an idea for this! THANKS!
Sorry for the spam post I couldn't edit this post,
but I just wanted to say that my failing to search the forum created this unnecessary post and I apologize.
I didn't know that requests was used now instead of urllib.
but I ended up putting my code as such using requests to get the same outcome I was trying to achieve before:
from bs4 import BeautifulSoup as soup import requests wiki_url = 'https://en.wikipedia.org/wiki/Genome' wiki_html = requests.get(wiki_url, verify=True) page_soup = soup(wiki_html.text, 'html.parser') print(page_soup)Thanks, and sorry again for posting this if someone can delete this post that would be great otherwise I marked it as solved.
FalseFact