urlib - to use or not to use ( for web scraping )? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: urlib - to use or not to use ( for web scraping )? (/thread-13080.html) |
RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-28-2018 Quote:that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative.Well, you don't unless someone tells you or you go digging. You can start by looking at the documentation of top level packages. For example if you look at requests by itself, you'll see Each of these has separate documents, and utils is listed here.As a habit, when I'm not busy, I browse various packages to see what they contain. There's no way to know what every one of them is and what it contains. As of this minute, pypi contains 159,959 packages. RE: urlib - to use or not to use ( for web scraping )? - wavic - Nov-29-2018 (Nov-28-2018, 10:25 PM)Truman Wrote: that's cool but the condition to be met is that we know at first that utils exists...Is there any command such as dir that gives a list of all the methods? It looks that answer is negative. There is a way. If you use IPython for example as a REPL. Or bpython. Import the requests module, type requests. and hit TAB for autocompletion. You will see it - requests.utils is there.You can do autocompletion in the Python's REPL too: https://gableroux.com/python/2016/01/20/python-interpreter-autocomplete/ RE: urlib - to use or not to use ( for web scraping )? - Truman - Nov-29-2018 Larz, where do you get this PACKAGE CONTENTS? I don't see it on pypi. wavic, things you mentioned are completely new to me. So far I used only Microsoft Azure Notebooks ( with Jupyter ). Should I install it from IPython through Anakonda? RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Nov-29-2018 open a python interpreter: Book $ python Python 3.7.1 (default, Nov 20 2018, 18:13:14) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>>import requests >>>help(requests)scroll down (or spacebar for next page) you'll find a list near the top. There are classes within the package RE: urlib - to use or not to use ( for web scraping )? - wavic - Nov-30-2018 (Nov-29-2018, 11:10 PM)Truman Wrote: Larz, where do you get this PACKAGE CONTENTS? I don't see it on pypi.You can do the same with Jupyter. It is a successor of IPython. RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-10-2018 Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?for ex. in this code: response = requests.get("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')I tried to add .utils to this code on several positions but it doesn't work.
RE: urlib - to use or not to use ( for web scraping )? - snippsat - Dec-10-2018 (Dec-10-2018, 11:15 PM)Truman Wrote: Any idea what substitute to use with requests for read() and decode() attributes that are a part of urlib?You do not need to decode with Requests,one of the big advantages is that it get correct encoding from a web-site. >>> import requests >>> >>> r = requests.get('http://python.org') >>> r.status_code 200 >>> r.encoding 'utf-8' # What encoding this web-site useSo print(r.text) get the correct encoding back. Just remember that use content and not text when use a parser eg BS.Because BS do own encoding to Unicode,so it's not been encoding 2 times. Example: from bs4 import BeautifulSoup import requests url = 'https://www.python.org/' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') # See that content i used print(soup.select('head > title')[0].text)
RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-11-2018 Thank you. Now I'm trying to make a code that downloads images from a page and as you can imagine...it doesn't go that well import requests from bs4 import BeautifulSoup import shutil html = requests.get("http://www.pythonscraping.com", stream=True) bsObj = BeautifulSoup(html.content, 'html.parser') imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"] with open('img.jpg', 'wb') as out_file: shutil.copyfileobj(imageLocation, out_file) this is the urlib code from the book that I'm trying to transform:from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com") bsObj = BeautifulSoup(html) imageLocation = bsObj.find("a", {"id": "logo"}).find("img")["src"] urlretrieve (imageLocation, "logo.jpg") RE: urlib - to use or not to use ( for web scraping )? - Larz60+ - Dec-12-2018 Try: import requests from bs4 import BeautifulSoup url = "http://www.pythonscraping.com" html = requests.get(url, stream=True) if html.status_code == 200: bsObj = BeautifulSoup(html.content, 'html.parser') imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"] image = requests.get(imageLocation) if image.status_code == 200: with open('img.jpg', 'wb') as out_file: out_file.write(image.content) else: print(f'Problem fetching image status code: {image.status_code}') else: print(f'Problem fetching {url} status code: {html.status_code}')-- Edit Modified 2nd request, should check status code -- RE: urlib - to use or not to use ( for web scraping )? - snippsat - Dec-12-2018 Larz60+ code is correct. In your first code only need to change line 9 to this,and remove shutil. out_file.write(requests.get(image_location).content)Sometime is also okay to get original image name. import requests, os from bs4 import BeautifulSoup html = requests.get("http://www.pythonscraping.com") bs_obj = BeautifulSoup(html.content, 'html.parser') image_location = bs_obj.find("a", id='logo').find("img")["src"] image_name = os.path.basename(image_location) with open(image_name, 'wb') as out_file: out_file.write(requests.get(image_location).content) |