Python Forum
Thread Rating:
  • 1 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
crawler: Get external page
#1
The code is a bit confusing to me but let's stick to errors for a start. The goal is to start from some page and then find external links following one by one:

import requests
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

# Retrieves a list of all Internal links found on a page 
def getInternalLinks(bsObj, includeUrl):
	includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
	internalLinks = []
	# finds all links that begin with "/"
	for link in bsObj.find_all("a", href=re.compile("^(/|.*"+includeUrl+")")):
		if link.attrs['href'] is not None:
			if link.attrs['href'] not in internalLinks:
				if(link.attrs['href'].startswith("/")):
					internalLinks.append(includeUrl+link.attrs['href'])
				else:
					internalLinks.append(link.attrs['href'])
	return internalLinks

# Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
	externalLinks = []
	# finds all links that start with "http" that do not contain the current URL
	for link in bsObj.find_all("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
		if link.attrs['href'] not in externalLinks:
			externalLinks.append(link.attrs['href'])
		return externalLinks 

def getRandomExternalLink(startingPage):
	html = requests.get(startingPage)
	bsObj = BeautifulSoup(html.content, 'html.parser')
	externalLinks = getExternalLinks(bsObj, requests.utils.urlparse(startingPage).netloc)
	if len(externalLinks) == 0:
		print("No external links, looking around the site for one")
		domain = requests.utils.urlparse(startingPage).scheme+"://"+urlparse(startingPage).netloc
		internalLinks = getInternalLinks(bsObj, domain)
		return getRandomExternalLink(internalLinks[random.randint(0, len(internalLinks)-1)])
	else:
		return externalLinks[random.randint(0, len(externalLinks)-1)]

def followExternalOnly(startingSite):
	externalLink = getRandomExternalLink(startingSite)
	print("Random external link is: "+externalLink)
	followExternalOnly(externalLink)

followExternalOnly("http://oreilly.com")
error:
Error:
Random external link is: https://www.safaribooksonline.com/public/free-trial/ Traceback (most recent call last): File "C:\Python36\kodovi\crawler.py", line 52, in <module> followExternalOnly("http://oreilly.com") File "C:\Python36\kodovi\crawler.py", line 50, in followExternalOnly followExternalOnly(externalLink) File "C:\Python36\kodovi\crawler.py", line 48, in followExternalOnly externalLink = getRandomExternalLink(startingSite) File "C:\Python36\kodovi\crawler.py", line 39, in getRandomExternalLink if len(externalLinks) == 0: TypeError: object of type 'NoneType' has no len()
I tried to switch line 37 with if externalLinks is None
but then get this error:
Error:
Random external link is: https://www.safaribooksonline.com/public/free-trial/ No external links, looking around the site for one Traceback (most recent call last): File "C:\Python36\kodovi\crawler.py", line 52, in <module> followExternalOnly("http://oreilly.com") File "C:\Python36\kodovi\crawler.py", line 50, in followExternalOnly followExternalOnly(externalLink) File "C:\Python36\kodovi\crawler.py", line 48, in followExternalOnly externalLink = getRandomExternalLink(startingSite) File "C:\Python36\kodovi\crawler.py", line 41, in getRandomExternalLink domain = requests.utils.urlparse(startingPage).scheme+"://"+urlparse(startin gPage).netloc NameError: name 'urlparse' is not defined
Not sure how urlparse is not defined when it's part of requests modul. If you have any suggestion how to improve this code I'll be glad to hear it. Also, have an impression that it's too complicated but I'm only following a book.
Reply
#2
(Nov-29-2018, 11:40 PM)Truman Wrote:
Error:
urlparse(startin gPage).netloc
its not preceded by requests.utils
Recommended Tutorials:
Reply
#3
It's not preceeded in line 12 neither but there is no problem there. Why?

p.s. now I see that I have to add it there too and finally the program works! Will add some extra function tomorrow with an another question. Stay tuned!
Reply
#4
this piece of code can be added too:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
	html = requests.get(siteUrl)
	domain = requests.utils.urlparse(siteUrl).scheme+"://"+requests.utils.urlparse(siteUrl).netloc
	bsObj = BeautifulSoup(html.content, 'html.parser')
	internalLinks = getInternalLinks(bsObj, domain)
	externalLinks = getAllExternalLinks(bsObj, domain)
	
for link in externalLinks:
    if link not in allExtLinks:
        allExtLinks.add(link)
        print(link)
for link in internalLinks:
	if link not in allIntLinks:
		allIntLinks.add(links)
		getAllExternalLinks(link)
			
allIntLinks.add("http://oreilly.com")
getAllExternalLinks("http://oreilly.com")
question: why do we need both bsObj and domain if they parse the same webpage?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web Crawler help Mr_Mafia 2 1,887 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,637 Mar-19-2020, 06:13 PM
Last Post: apollo
  Web Crawler help takaa 39 27,226 Apr-26-2019, 12:14 PM
Last Post: stateitreal

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020