Python Forum
urlib - to use or not to use ( for web scraping )? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: urlib - to use or not to use ( for web scraping )? (/thread-13080.html)

Pages: 1 2 3 4 5


RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018

Now came to the part of the book that uses
Output:
urllib.parse
for ex:
includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
it seems to me that author used methods that are specific for urlib modul. Is there any alternative with BeautifulSoup?


RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-27-2018

The only two I ever need with BeautifulSoup are lxml and html
what does the urllib statement do?


RE: urlib - to use or not to use ( for web scraping )? - stranac - Nov-27-2018

bs4 is an html parser (as is lxml.html), it doesn't parse urls.
urlib.parse is pretty good at what it does, I use it from time to time along with w3lib.url


RE: urlib - to use or not to use ( for web scraping )? - snippsat - Nov-27-2018

Requests has it in utils.
>>> import requests
>>> 
>>> url = 'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u = requests.utils.urlparse(url)
>>> u
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> u.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> u.scheme
'http'
>>> u.netloc
'www.cwi.nl:80'
There no problem to use urllib.parse for a so specific task as parse url.
requests.utils has stuff that was not the main goal of Requests project,so documentation is spare.


RE: urlib - to use or not to use ( for web scraping )? - stranac - Nov-27-2018

Oh wow, I've never seen requests.utils before...


RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018

When I check methods that requests has with
dir('requests')
there is no mention of utils. We'll see if I'll have to use urlib after all. This is a long book.
Interestingly
dir('urlib')
doesn't give it neither.
Still confused, I guess that practise is the only remedy.


RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-27-2018

The documentation (auto generated) is short:
Output:
Help on module requests.utils in requests: NAME requests.utils DESCRIPTION requests.utils ~~~~~~~~~~~~~~ This module provides utility functions that are used within Requests that are also useful for external consumption. FUNCTIONS add_dict_to_cookiejar(cj, cookie_dict) Returns a CookieJar from a key/value dictionary. :param cj: CookieJar to insert cookies into. :param cookie_dict: Dict of key/values to insert into CookieJar. :rtype: CookieJar address_in_network(ip, net) This function allows you to check if an IP belongs to a network subnet address_in_network(ip, net) This function allows you to check if an IP belongs to a network subnet Example: returns True if ip = 192.168.1.1 and net = 192.168.1.0/24 returns False if ip = 192.168.1.1 and net = 192.168.100.0/24 :rtype: bool



RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018

Larz, where do you find these stuff? It's not in official docs for requests.
http://docs.python-requests.org/en/master/
And how should we know that all these things exist if we can't find them by calling dir...

by the way, I transformed code from message #21 and it works fine
includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc



RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-28-2018

It's simple, open an interactive python session, import module of interest and issue command like:
>>> import requests
>>> help(requests.utils)

>>>
Help on module requests.utils in requests:

NAME
    requests.utils

DESCRIPTION
    requests.utils
    ~~~~~~~~~~~~~~
    
    This module provides utility functions that are used within Requests
    that are also useful for external consumption.
...



RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-28-2018

that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.