urlib - to use or not to use ( for web scraping )?

urlib - to use or not to use ( for web scraping )? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: urlib - to use or not to use ( for web scraping )? (/thread-13080.html)

Pages: 1 2 3 4 5

RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018

Now came to the part of the book that uses

Output:
urllib.parse

for ex:

includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc

it seems to me that author used methods that are specific for urlib modul. Is there any alternative with BeautifulSoup?

RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-27-2018

The only two I ever need with BeautifulSoup are lxml and html
what does the urllib statement do?

RE: urlib - to use or not to use ( for web scraping )? - stranac - Nov-27-2018

bs4 is an html parser (as is lxml.html), it doesn't parse urls.
urlib.parse is pretty good at what it does, I use it from time to time along with w3lib.url

RE: urlib - to use or not to use ( for web scraping )? - snippsat - Nov-27-2018

Requests has it in utils.

>>> import requests
>>> 
>>> url = 'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u = requests.utils.urlparse(url)
>>> u
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> u.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u.scheme
'http'
>>> u.netloc
'www.cwi.nl:80'

There no problem to use urllib.parse for a so specific task as parse url.
requests.utils has stuff that was not the main goal of Requests project,so documentation is spare.

RE: urlib - to use or not to use ( for web scraping )? - stranac - Nov-27-2018

Oh wow, I've never seen requests.utils before...

RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018

When I check methods that requests has with

dir('requests')

there is no mention of utils. We'll see if I'll have to use urlib after all. This is a long book.
Interestingly

dir('urlib')

doesn't give it neither.
Still confused, I guess that practise is the only remedy.

RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-27-2018

The documentation (auto generated) is short:

Output:Help on module requests.utils in requests:

NAME
    requests.utils

DESCRIPTION
    requests.utils
    ~~~~~~~~~~~~~~
    
    This module provides utility functions that are used within Requests
    that are also useful for external consumption.

FUNCTIONS
    add_dict_to_cookiejar(cj, cookie_dict)
        Returns a CookieJar from a key/value dictionary.
        
        :param cj: CookieJar to insert cookies into.
        :param cookie_dict: Dict of key/values to insert into CookieJar.
        :rtype: CookieJar
    
    address_in_network(ip, net)
        This function allows you to check if an IP belongs to a network subnet

address_in_network(ip, net)
    This function allows you to check if an IP belongs to a network subnet
    
    Example: returns True if ip = 192.168.1.1 and net = 192.168.1.0/24
             returns False if ip = 192.168.1.1 and net = 192.168.100.0/24
    
    :rtype: bool

RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018

Larz, where do you find these stuff? It's not in official docs for requests.
http://docs.python-requests.org/en/master/
And how should we know that all these things exist if we can't find them by calling dir...

by the way, I transformed code from message #21 and it works fine

includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc

RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-28-2018

It's simple, open an interactive python session, import module of interest and issue command like:

>>> import requests
>>> help(requests.utils)

>>>
Help on module requests.utils in requests:

NAME
    requests.utils

DESCRIPTION
    requests.utils
    ~~~~~~~~~~~~~~
    
    This module provides utility functions that are used within Requests
    that are also useful for external consumption.
...

RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-28-2018

that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.