Python Forum

Full Version: urlib - to use or not to use ( for web scraping )?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
Now came to the part of the book that uses
Output:
urllib.parse
for ex:
includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
it seems to me that author used methods that are specific for urlib modul. Is there any alternative with BeautifulSoup?
The only two I ever need with BeautifulSoup are lxml and html
what does the urllib statement do?
bs4 is an html parser (as is lxml.html), it doesn't parse urls.
urlib.parse is pretty good at what it does, I use it from time to time along with w3lib.url
Requests has it in utils.
>>> import requests
>>> 
>>> url = 'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u = requests.utils.urlparse(url)
>>> u
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> u.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u.scheme
'http'
>>> u.netloc
'www.cwi.nl:80'
There no problem to use urllib.parse for a so specific task as parse url.
requests.utils has stuff that was not the main goal of Requests project,so documentation is spare.
Oh wow, I've never seen requests.utils before...
When I check methods that requests has with
dir('requests')
there is no mention of utils. We'll see if I'll have to use urlib after all. This is a long book.
Interestingly
dir('urlib')
doesn't give it neither.
Still confused, I guess that practise is the only remedy.
The documentation (auto generated) is short:
Output:
Help on module requests.utils in requests: NAME requests.utils DESCRIPTION requests.utils ~~~~~~~~~~~~~~ This module provides utility functions that are used within Requests that are also useful for external consumption. FUNCTIONS add_dict_to_cookiejar(cj, cookie_dict) Returns a CookieJar from a key/value dictionary. :param cj: CookieJar to insert cookies into. :param cookie_dict: Dict of key/values to insert into CookieJar. :rtype: CookieJar address_in_network(ip, net) This function allows you to check if an IP belongs to a network subnet address_in_network(ip, net) This function allows you to check if an IP belongs to a network subnet Example: returns True if ip = 192.168.1.1 and net = 192.168.1.0/24 returns False if ip = 192.168.1.1 and net = 192.168.100.0/24 :rtype: bool
Larz, where do you find these stuff? It's not in official docs for requests.
http://docs.python-requests.org/en/master/
And how should we know that all these things exist if we can't find them by calling dir...

by the way, I transformed code from message #21 and it works fine
includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
It's simple, open an interactive python session, import module of interest and issue command like:
>>> import requests
>>> help(requests.utils)

>>>
Help on module requests.utils in requests:

NAME
    requests.utils

DESCRIPTION
    requests.utils
    ~~~~~~~~~~~~~~
    
    This module provides utility functions that are used within Requests
    that are also useful for external consumption.
...
that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.
Pages: 1 2 3 4 5