crawler: Get external page

Truman · Nov-30-2018, 11:40 PM

this piece of code can be added too:

# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
	html = requests.get(siteUrl)
	domain = requests.utils.urlparse(siteUrl).scheme+"://"+requests.utils.urlparse(siteUrl).netloc
	bsObj = BeautifulSoup(html.content, 'html.parser')
	internalLinks = getInternalLinks(bsObj, domain)
	externalLinks = getAllExternalLinks(bsObj, domain)
	
for link in externalLinks:
    if link not in allExtLinks:
        allExtLinks.add(link)
        print(link)
for link in internalLinks:
	if link not in allIntLinks:
		allIntLinks.add(links)
		getAllExternalLinks(link)
			
allIntLinks.add("http://oreilly.com")
getAllExternalLinks("http://oreilly.com")

question: why do we need both bsObj and domain if they parse the same webpage?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,920	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	use Xpath in Python :: libxml2 for a page-to-page skip-setting	apollo	2	3,668	Mar-19-2020, 06:13 PM Last Post: apollo
	Web Crawler help	takaa	39	27,500	Apr-26-2019, 12:14 PM Last Post: stateitreal

crawler: Get external page

User Panel Messages

Announcements