Python Forum
urlib - to use or not to use ( for web scraping )?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
urlib - to use or not to use ( for web scraping )?
#31
Quote:that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.
Well, you don't unless someone tells you or you go digging.
You can start by looking at the documentation of top level packages.
For example if you look at requests by itself, you'll see
Output:
PACKAGE CONTENTS __version__ _internal_utils adapters api auth certs compat cookies exceptions help hooks models packages sessions status_codes structures utils
Each of these has separate documents, and utils is listed here.

As a habit, when I'm not busy, I browse various packages to see what they contain. There's no way
to know what every one of them is and what it contains.
As of this minute, pypi contains 159,959 packages.
Reply
#32
(Nov-28-2018, 10:25 PM)Truman Wrote: that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.

There is a way. If you use IPython for example as a REPL. Or bpython. Import the requests module, type requests. and hit TAB for autocompletion. You will see it - requests.utils is there.

You can do autocompletion in the Python's REPL too: https://gableroux.com/python/2016/01/20/...ocomplete/
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#33
Larz, where do you get this PACKAGE CONTENTS? I don't see it on pypi.

wavic, things you mentioned are completely new to me. So far I used only Microsoft Azure Notebooks ( with Jupyter ). Should I install it from IPython through Anakonda?
Reply
#34
open a python interpreter:
Book $ python
Python 3.7.1 (default, Nov 20 2018, 18:13:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>import requests
>>>help(requests)
scroll down (or spacebar for next page)
you'll find a list near the top.
There are classes within the package
Reply
#35
(Nov-29-2018, 11:10 PM)Truman Wrote: Larz, where do you get this PACKAGE CONTENTS? I don't see it on pypi.

wavic, things you mentioned are completely new to me. So far I used only Microsoft Azure Notebooks ( with Jupyter ). Should I install it from IPython through Anakonda?
You can do the same with Jupyter. It is a successor of IPython.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#36
Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?
for ex. in this code:
response = requests.get("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')
I tried to add .utils to this code on several positions but it doesn't work.
Reply
#37
(Dec-10-2018, 11:15 PM)Truman Wrote: Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?
You do not need to decode with Requests,one of the big advantages is that it get correct encoding from a web-site.
>>> import requests
>>> 
>>> r = requests.get('http://python.org')
>>> r.status_code
200
>>> r.encoding
'utf-8'  # What encoding this web-site use
So print(r.text) get the correct encoding back.
Output:
>>> print(r.text) <!doctype html> <!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]--> <!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr"> <!--<![endif]--> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js"> <meta name="application-name" content="Python.org"> <meta name="msapplication-tooltip" content="The official home of the Python Programming Language"> <meta name="apple-mobile-web-app-title" content="Python.org"> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="HandheldFriendly" content="True"> <meta name="format-detection" content="telephone=no"> <meta http-equiv="cleartype" content="on"> <meta http-equiv="imagetoolbar" content="false"> <script src="/static/js/libs/modernizr.js"></script> <link href="/static/stylesheets/style.css" rel="stylesheet" type="text/css" title="default" /> <link href="/static/stylesheets/mq.css" rel="stylesheet" type="text/css" media="not print, braille, embossed, speech, tty" /> <!--[if (lte IE 8)&(!IEMobile)]> <link href="/static/stylesheets/no-mq.css" rel="stylesheet" type="text/css" media="screen" /> <![endif]--> .........................
Just remember that use content and not text when use a parser eg BS.
Because BS do own encoding to Unicode,so it's not been encoding 2 times.
Example:
from bs4 import BeautifulSoup
import requests
 
url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml') # See that content i used
print(soup.select('head > title')[0].text)
Output:
Welcome to Python.org
Reply
#38
Thank you.

Now I'm trying to make a code that downloads images from a page and as you can imagine...it doesn't go that well
import requests
from bs4 import BeautifulSoup
import shutil

html = requests.get("http://www.pythonscraping.com", stream=True)
bsObj = BeautifulSoup(html.content, 'html.parser')
imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]
with open('img.jpg', 'wb') as out_file:
	shutil.copyfileobj(imageLocation, out_file)
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\crawler3.py", line 9, in <module> shutil.copyfileobj(imageLocation, out_file) File "C:\Python36\lib\shutil.py", line 79, in copyfileobj buf = fsrc.read(length) AttributeError: 'str' object has no attribute 'read'
this is the urlib code from the book that I'm trying to transform:
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
imageLocation = bsObj.find("a", {"id": "logo"}).find("img")["src"]
urlretrieve (imageLocation, "logo.jpg")
Reply
#39
Try:
import requests
from bs4 import BeautifulSoup


url = "http://www.pythonscraping.com"
html = requests.get(url, stream=True)
if html.status_code == 200:
    bsObj = BeautifulSoup(html.content, 'html.parser')
    imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]

    image = requests.get(imageLocation)
    if image.status_code == 200:
        with open('img.jpg', 'wb') as out_file:
            out_file.write(image.content)
    else:
        print(f'Problem fetching image status code: {image.status_code}')
else:
    print(f'Problem fetching {url} status code: {html.status_code}')
-- Edit Modified 2nd request, should check status code --
Reply
#40
Larz60+ code is correct.
In your first code only need to change line 9 to this,and remove shutil.
out_file.write(requests.get(image_location).content)
Sometime is also okay to get original image name.
import requests, os
from bs4 import BeautifulSoup

html = requests.get("http://www.pythonscraping.com")
bs_obj = BeautifulSoup(html.content, 'html.parser')
image_location = bs_obj.find("a", id='logo').find("img")["src"]
image_name = os.path.basename(image_location)
with open(image_name, 'wb') as out_file:
    out_file.write(requests.get(image_location).content)
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020