Python Forum
Thread Rating:
  • 1 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
crawler: Get external page
#4
this piece of code can be added too:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
	html = requests.get(siteUrl)
	domain = requests.utils.urlparse(siteUrl).scheme+"://"+requests.utils.urlparse(siteUrl).netloc
	bsObj = BeautifulSoup(html.content, 'html.parser')
	internalLinks = getInternalLinks(bsObj, domain)
	externalLinks = getAllExternalLinks(bsObj, domain)
	
for link in externalLinks:
    if link not in allExtLinks:
        allExtLinks.add(link)
        print(link)
for link in internalLinks:
	if link not in allIntLinks:
		allIntLinks.add(links)
		getAllExternalLinks(link)
			
allIntLinks.add("http://oreilly.com")
getAllExternalLinks("http://oreilly.com")
question: why do we need both bsObj and domain if they parse the same webpage?
Reply


Messages In This Thread
crawler: Get external page - by Truman - Nov-29-2018, 11:40 PM
RE: crawler: Get external page - by metulburr - Nov-30-2018, 12:14 AM
RE: crawler: Get external page - by Truman - Nov-30-2018, 12:43 AM
RE: crawler: Get external page - by Truman - Nov-30-2018, 11:40 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Web Crawler help Mr_Mafia 2 1,920 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,668 Mar-19-2020, 06:13 PM
Last Post: apollo
  Web Crawler help takaa 39 27,500 Apr-26-2019, 12:14 PM
Last Post: stateitreal

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020