Python Forum
urlib - to use or not to use ( for web scraping )?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
urlib - to use or not to use ( for web scraping )?
#37
(Dec-10-2018, 11:15 PM)Truman Wrote: Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?
You do not need to decode with Requests,one of the big advantages is that it get correct encoding from a web-site.
>>> import requests
>>> 
>>> r = requests.get('http://python.org')
>>> r.status_code
200
>>> r.encoding
'utf-8'  # What encoding this web-site use
So print(r.text) get the correct encoding back.
Output:
>>> print(r.text) <!doctype html> <!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]--> <!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr"> <!--<![endif]--> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js"> <meta name="application-name" content="Python.org"> <meta name="msapplication-tooltip" content="The official home of the Python Programming Language"> <meta name="apple-mobile-web-app-title" content="Python.org"> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="HandheldFriendly" content="True"> <meta name="format-detection" content="telephone=no"> <meta http-equiv="cleartype" content="on"> <meta http-equiv="imagetoolbar" content="false"> <script src="/static/js/libs/modernizr.js"></script> <link href="/static/stylesheets/style.css" rel="stylesheet" type="text/css" title="default" /> <link href="/static/stylesheets/mq.css" rel="stylesheet" type="text/css" media="not print, braille, embossed, speech, tty" /> <!--[if (lte IE 8)&(!IEMobile)]> <link href="/static/stylesheets/no-mq.css" rel="stylesheet" type="text/css" media="screen" /> <![endif]--> .........................
Just remember that use content and not text when use a parser eg BS.
Because BS do own encoding to Unicode,so it's not been encoding 2 times.
Example:
from bs4 import BeautifulSoup
import requests
 
url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml') # See that content i used
print(soup.select('head > title')[0].text)
Output:
Welcome to Python.org
Reply


Messages In This Thread
RE: urlib - to use or not to use ( for web scraping )? - by snippsat - Dec-10-2018, 11:51 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020