Nov-30-2018, 11:40 PM
this piece of code can be added too:
# Collects a list of all external URLs found on the site allExtLinks = set() allIntLinks = set() def getAllExternalLinks(siteUrl): html = requests.get(siteUrl) domain = requests.utils.urlparse(siteUrl).scheme+"://"+requests.utils.urlparse(siteUrl).netloc bsObj = BeautifulSoup(html.content, 'html.parser') internalLinks = getInternalLinks(bsObj, domain) externalLinks = getAllExternalLinks(bsObj, domain) for link in externalLinks: if link not in allExtLinks: allExtLinks.add(link) print(link) for link in internalLinks: if link not in allIntLinks: allIntLinks.add(links) getAllExternalLinks(link) allIntLinks.add("http://oreilly.com") getAllExternalLinks("http://oreilly.com")question: why do we need both bsObj and domain if they parse the same webpage?