Python Forum
urlib - to use or not to use ( for web scraping )?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
urlib - to use or not to use ( for web scraping )?
#21
Now came to the part of the book that uses
Output:
urllib.parse
for ex:
includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
it seems to me that author used methods that are specific for urlib modul. Is there any alternative with BeautifulSoup?
Reply
#22
The only two I ever need with BeautifulSoup are lxml and html
what does the urllib statement do?
Reply
#23
bs4 is an html parser (as is lxml.html), it doesn't parse urls.
urlib.parse is pretty good at what it does, I use it from time to time along with w3lib.url
Reply
#24
Requests has it in utils.
>>> import requests
>>> 
>>> url = 'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u = requests.utils.urlparse(url)
>>> u
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> u.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u.scheme
'http'
>>> u.netloc
'www.cwi.nl:80'
There no problem to use urllib.parse for a so specific task as parse url.
requests.utils has stuff that was not the main goal of Requests project,so documentation is spare.
Reply
#25
Oh wow, I've never seen requests.utils before...
Reply
#26
When I check methods that requests has with
dir('requests')
there is no mention of utils. We'll see if I'll have to use urlib after all. This is a long book.
Interestingly
dir('urlib')
doesn't give it neither.
Still confused, I guess that practise is the only remedy.
Reply
#27
The documentation (auto generated) is short:
Output:
Help on module requests.utils in requests: NAME requests.utils DESCRIPTION requests.utils ~~~~~~~~~~~~~~ This module provides utility functions that are used within Requests that are also useful for external consumption. FUNCTIONS add_dict_to_cookiejar(cj, cookie_dict) Returns a CookieJar from a key/value dictionary. :param cj: CookieJar to insert cookies into. :param cookie_dict: Dict of key/values to insert into CookieJar. :rtype: CookieJar address_in_network(ip, net) This function allows you to check if an IP belongs to a network subnet address_in_network(ip, net) This function allows you to check if an IP belongs to a network subnet Example: returns True if ip = 192.168.1.1 and net = 192.168.1.0/24 returns False if ip = 192.168.1.1 and net = 192.168.100.0/24 :rtype: bool
Reply
#28
Larz, where do you find these stuff? It's not in official docs for requests.
http://docs.python-requests.org/en/master/
And how should we know that all these things exist if we can't find them by calling dir...

by the way, I transformed code from message #21 and it works fine
includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
Reply
#29
It's simple, open an interactive python session, import module of interest and issue command like:
>>> import requests
>>> help(requests.utils)

>>>
Help on module requests.utils in requests:

NAME
    requests.utils

DESCRIPTION
    requests.utils
    ~~~~~~~~~~~~~~
    
    This module provides utility functions that are used within Requests
    that are also useful for external consumption.
...
Reply
#30
that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020