urlib - to use or not to use ( for web scraping )?

urlib - to use or not to use ( for web scraping )? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: urlib - to use or not to use ( for web scraping )? (/thread-13080.html)

Pages: 1 2 3 4 5

RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-28-2018

Quote:that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.

Well, you don't unless someone tells you or you go digging.
You can start by looking at the documentation of top level packages.
For example if you look at requests by itself, you'll see

Output:PACKAGE CONTENTS
    __version__
    _internal_utils
    adapters
    api
    auth
    certs
    compat
    cookies
    exceptions
    help
    hooks
    models
    packages
    sessions
    status_codes
    structures
    utils

Each of these has separate documents, and utils is listed here.

As a habit, when I'm not busy, I browse various packages to see what they contain. There's no way
to know what every one of them is and what it contains.
As of this minute, pypi contains 159,959 packages.

RE: urlib - to use or not to use ( for web scraping )? - wavic - Nov-29-2018

(Nov-28-2018, 10:25 PM)Truman Wrote: that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.

There is a way. If you use IPython for example as a REPL. Or bpython. Import the requests module, type requests. and hit TAB for autocompletion. You will see it - requests.utils is there.

You can do autocompletion in the Python's REPL too: https://gableroux.com/python/2016/01/20/python-interpreter-autocomplete/

RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-29-2018

Larz, where do you get this PACKAGE CONTENTS? I don't see it on pypi.

wavic, things you mentioned are completely new to me. So far I used only Microsoft Azure Notebooks ( with Jupyter ). Should I install it from IPython through Anakonda?

RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-29-2018

open a python interpreter:

Book $ python
Python 3.7.1 (default, Nov 20 2018, 18:13:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>import requests
>>>help(requests)

scroll down (or spacebar for next page)
you'll find a list near the top.
There are classes within the package

RE: urlib - to use or not to use ( for web scraping )? - wavic - Nov-30-2018

(Nov-29-2018, 11:10 PM)Truman Wrote: Larz, where do you get this PACKAGE CONTENTS? I don't see it on pypi.

wavic, things you mentioned are completely new to me. So far I used only Microsoft Azure Notebooks ( with Jupyter ). Should I install it from IPython through Anakonda?

You can do the same with Jupyter. It is a successor of IPython.

RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-10-2018

Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?
for ex. in this code:

response = requests.get("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')

I tried to add .utils to this code on several positions but it doesn't work.

RE: urlib - to use or not to use ( for web scraping )? - snippsat - Dec-10-2018

(Dec-10-2018, 11:15 PM)Truman Wrote: Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?

You do not need to decode with Requests,one of the big advantages is that it get correct encoding from a web-site.

>>> import requests
>>> 
>>> r = requests.get('http://python.org')
>>> r.status_code
200
>>> r.encoding
'utf-8'  # What encoding this web-site use

So print(r.text) get the correct encoding back.

Output:>>> print(r.text)
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">

    <meta name="application-name" content="Python.org">
    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">
    <meta name="apple-mobile-web-app-title" content="Python.org">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">

    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="HandheldFriendly" content="True">
    <meta name="format-detection" content="telephone=no">
    <meta http-equiv="cleartype" content="on">
    <meta http-equiv="imagetoolbar" content="false">

    <script src="/static/js/libs/modernizr.js"></script>

    <link href="/static/stylesheets/style.css" rel="stylesheet" type="text/css" title="default" />
    <link href="/static/stylesheets/mq.css" rel="stylesheet" type="text/css" media="not print, braille, embossed, speech, tty" />
    

    <!--[if (lte IE 8)&(!IEMobile)]>
    <link href="/static/stylesheets/no-mq.css" rel="stylesheet" type="text/css" media="screen" />
    
    
    <![endif]-->
.........................

Just remember that use content and not text when use a parser eg BS.
Because BS do own encoding to Unicode,so it's not been encoding 2 times.
Example:

from bs4 import BeautifulSoup
import requests
 
url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml') # See that content i used
print(soup.select('head > title')[0].text)

Output:
Welcome to Python.org

RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-11-2018

Thank you.

Now I'm trying to make a code that downloads images from a page and as you can imagine...it doesn't go that well

import requests
from bs4 import BeautifulSoup
import shutil

html = requests.get("http://www.pythonscraping.com", stream=True)
bsObj = BeautifulSoup(html.content, 'html.parser')
imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]
with open('img.jpg', 'wb') as out_file:
	shutil.copyfileobj(imageLocation, out_file)

Error:Traceback (most recent call last):
  File "C:\Python36\kodovi\crawler3.py", line 9, in <module>
    shutil.copyfileobj(imageLocation, out_file)
  File "C:\Python36\lib\shutil.py", line 79, in copyfileobj
    buf = fsrc.read(length)
AttributeError: 'str' object has no attribute 'read'

this is the urlib code from the book that I'm trying to transform:

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
imageLocation = bsObj.find("a", {"id": "logo"}).find("img")["src"]
urlretrieve (imageLocation, "logo.jpg")

RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Dec-12-2018

Try:

import requests
from bs4 import BeautifulSoup


url = "http://www.pythonscraping.com"
html = requests.get(url, stream=True)
if html.status_code == 200:
    bsObj = BeautifulSoup(html.content, 'html.parser')
    imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]

    image = requests.get(imageLocation)
    if image.status_code == 200:
        with open('img.jpg', 'wb') as out_file:
            out_file.write(image.content)
    else:
        print(f'Problem fetching image status code: {image.status_code}')
else:
    print(f'Problem fetching {url} status code: {html.status_code}')

-- Edit Modified 2nd request, should check status code --

RE: urlib - to use or not to use ( for web scraping )? - snippsat - Dec-12-2018

Larz60+ code is correct.
In your first code only need to change line 9 to this,and remove shutil.

out_file.write(requests.get(image_location).content)

Sometime is also okay to get original image name.

import requests, os
from bs4 import BeautifulSoup

html = requests.get("http://www.pythonscraping.com")
bs_obj = BeautifulSoup(html.content, 'html.parser')
image_location = bs_obj.find("a", id='logo').find("img")["src"]
image_name = os.path.basename(image_location)
with open(image_name, 'wb') as out_file:
    out_file.write(requests.get(image_location).content)