urlib - to use or not to use ( for web scraping )? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: urlib - to use or not to use ( for web scraping )? (/thread-13080.html) |
RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018 Now came to the part of the book that uses for ex:includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netlocit seems to me that author used methods that are specific for urlib modul. Is there any alternative with BeautifulSoup? RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-27-2018 The only two I ever need with BeautifulSoup are lxml and html what does the urllib statement do? RE: urlib - to use or not to use ( for web scraping )? - stranac - Nov-27-2018 bs4 is an html parser (as is lxml.html ), it doesn't parse urls.urlib.parse is pretty good at what it does, I use it from time to time along with w3lib.url
RE: urlib - to use or not to use ( for web scraping )? - snippsat - Nov-27-2018 Requests has it in utils. >>> import requests >>> >>> url = 'http://www.cwi.nl:80/%7Eguido/Python.html' >>> u = requests.utils.urlparse(url) >>> u ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='') >>> u.geturl() 'http://www.cwi.nl:80/%7Eguido/Python.html' >>> u.scheme 'http' >>> u.netloc 'www.cwi.nl:80'There no problem to use urllib.parse for a so specific task as parse url. requests.utils has stuff that was not the main goal of Requests project,so documentation is spare. RE: urlib - to use or not to use ( for web scraping )? - stranac - Nov-27-2018 Oh wow, I've never seen requests.utils before...
RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018 When I check methods that requests has with dir('requests')there is no mention of utils. We'll see if I'll have to use urlib after all. This is a long book. Interestingly dir('urlib')doesn't give it neither. Still confused, I guess that practise is the only remedy. RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-27-2018 The documentation (auto generated) is short:
RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-27-2018 Larz, where do you find these stuff? It's not in official docs for requests. http://docs.python-requests.org/en/master/ And how should we know that all these things exist if we can't find them by calling dir ...by the way, I transformed code from message #21 and it works fine includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-28-2018 It's simple, open an interactive python session, import module of interest and issue command like: >>> import requests >>> help(requests.utils) >>> Help on module requests.utils in requests: NAME requests.utils DESCRIPTION requests.utils ~~~~~~~~~~~~~~ This module provides utility functions that are used within Requests that are also useful for external consumption. ... RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-28-2018 that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative. |