urlib - to use or not to use ( for web scraping )? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: urlib - to use or not to use ( for web scraping )? (/thread-13080.html) |
RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-12-2018 If I understand well os.path.basename is used to get the original name of the image, we can do without it, just by creating any file within open . When you two write a code it looks very simple.
RE: urlib - to use or not to use ( for web scraping )? - snippsat - Dec-13-2018 (Dec-12-2018, 11:09 PM)Truman Wrote: If I understand well Yes it was just to show that option.A tips get used to test stuff out interactive bye taking parts of code out,a better REPL like IPython or ptpython(what i use) also help. >>> import os >>> >>> image_location = 'http://www.pythonscraping.com/sites/default/files/lrg_0.jpg' >>> os.path.basename(image_location) 'lrg_0.jpg' >>> >>> help(os.path.basename) Help on function basename in module ntpath: basename(p) Returns the final component of a pathname RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-14-2018 Jupyter with python looks useful: https://hub.mybinder.org/user/ipython-ipython-in-depth-lib8lcz3/notebooks/binder/Index.ipynb# Now trying to execute some code - Jupyter doesn't recognize bs4 I'm working now on a more complex code and will put it here when I try smth. RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-15-2018 This code takes images from a page and prints them: import os import requests from bs4 import BeautifulSoup downloadDirectory = "downloaded" baseUrl = "http://pythonscraping.com" def getAbsoluteURL(baseUrl, source): if source.startswith("http://www."): url = "http://"+source[11:] elif source.startswith("http://"): url = source elif source.startswith("www."): url = source[4:] url = "http://"+source else: url = baseUrl+"/"+source if baseUrl not in url: return None return url def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory): path = absoluteUrl.replace("www.", "") path = path.replace(baseUrl, "") path = downloadDirectory+path directory = os.path.dirname(path) if not os.path.exists(directory): os.makedirs(directory) return path html = requests.get("http://www.pythonscraping.com") bsObj = BeautifulSoup(html.content, 'html.parser') downloadList = bsObj.find_all(src=True)now I want to download these images to the directory downloaded so I added this piece of code: with open(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory)) as out_file: out_file.write(fileUrl.content)but I get this: p.s. to add that execution of this code opened folder downloaded and empty folder img within it.
RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-17-2018 one error corrected by putting open under for loop for download in downloadList: fileUrl = getAbsoluteURL(baseUrl,download["src"]) if fileUrl is not None: print(fileUrl) with open(fileUrl, 'wb', getDownloadPath(baseUrl, fileUrl, downloadDirectory)) as out_file: out_file.write(fileUrl.content)but now an another one appears
RE: urlib - to use or not to use ( for web scraping )? - Truman - Dec-19-2018 import os import requests from bs4 import BeautifulSoup downloadDirectory = "downloaded" baseUrl = "http://pythonscraping.com" def getAbsoluteURL(baseUrl, source): if source.startswith("http://www."): url = "http://"+source[11:] elif source.startswith("http://"): url = source elif source.startswith("www."): url = source[4:] url = "http://"+source else: url = baseUrl+"/"+source if baseUrl not in url: return None return url def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory): path = absoluteUrl.replace("www.", "") path = path.replace(baseUrl, "") path = downloadDirectory+path directory = os.path.dirname(path) if not os.path.exists(directory): os.makedirs(directory) return path html = requests.get("http://www.pythonscraping.com") bsObj = BeautifulSoup(html.content, 'html.parser') downloadList = bsObj.find_all('img') for download in downloadList: fileUrl = getAbsoluteURL(baseUrl,download["src"]) if fileUrl is not None: print(fileUrl) r = requests.get(fileUrl, allow_redirects=True) filename = fileUrl.split('/')[-1] with open(filename, 'wb') as out_file: out_file.write(r.content)I made some correction in last 10 lines but the problem is now that it completely ommits folder 'downloaded' and getDownloadPath function. |