crawler: Get external page

Truman · (This post was last modified: Nov-30-2018, 12:41 AM by Truman.)

The code is a bit confusing to me but let's stick to errors for a start. The goal is to start from some page and then find external links following one by one:

import requests
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

# Retrieves a list of all Internal links found on a page 
def getInternalLinks(bsObj, includeUrl):
	includeUrl = requests.utils.urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
	internalLinks = []
	# finds all links that begin with "/"
	for link in bsObj.find_all("a", href=re.compile("^(/|.*"+includeUrl+")")):
		if link.attrs['href'] is not None:
			if link.attrs['href'] not in internalLinks:
				if(link.attrs['href'].startswith("/")):
					internalLinks.append(includeUrl+link.attrs['href'])
				else:
					internalLinks.append(link.attrs['href'])
	return internalLinks

# Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
	externalLinks = []
	# finds all links that start with "http" that do not contain the current URL
	for link in bsObj.find_all("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
		if link.attrs['href'] not in externalLinks:
			externalLinks.append(link.attrs['href'])
		return externalLinks 

def getRandomExternalLink(startingPage):
	html = requests.get(startingPage)
	bsObj = BeautifulSoup(html.content, 'html.parser')
	externalLinks = getExternalLinks(bsObj, requests.utils.urlparse(startingPage).netloc)
	if len(externalLinks) == 0:
		print("No external links, looking around the site for one")
		domain = requests.utils.urlparse(startingPage).scheme+"://"+urlparse(startingPage).netloc
		internalLinks = getInternalLinks(bsObj, domain)
		return getRandomExternalLink(internalLinks[random.randint(0, len(internalLinks)-1)])
	else:
		return externalLinks[random.randint(0, len(externalLinks)-1)]

def followExternalOnly(startingSite):
	externalLink = getRandomExternalLink(startingSite)
	print("Random external link is: "+externalLink)
	followExternalOnly(externalLink)

followExternalOnly("http://oreilly.com")

error:

Error:Random external link is: https://www.safaribooksonline.com/public/free-trial/
Traceback (most recent call last):
  File "C:\Python36\kodovi\crawler.py", line 52, in <module>
    followExternalOnly("http://oreilly.com")
  File "C:\Python36\kodovi\crawler.py", line 50, in followExternalOnly
    followExternalOnly(externalLink)
  File "C:\Python36\kodovi\crawler.py", line 48, in followExternalOnly
    externalLink = getRandomExternalLink(startingSite)
  File "C:\Python36\kodovi\crawler.py", line 39, in getRandomExternalLink
    if len(externalLinks) == 0:
TypeError: object of type 'NoneType' has no len()

I tried to switch line 37 with if externalLinks is None
but then get this error:

Error:Random external link is: https://www.safaribooksonline.com/public/free-trial/
No external links, looking around the site for one
Traceback (most recent call last):
  File "C:\Python36\kodovi\crawler.py", line 52, in <module>
    followExternalOnly("http://oreilly.com")
  File "C:\Python36\kodovi\crawler.py", line 50, in followExternalOnly
    followExternalOnly(externalLink)
  File "C:\Python36\kodovi\crawler.py", line 48, in followExternalOnly
    externalLink = getRandomExternalLink(startingSite)
  File "C:\Python36\kodovi\crawler.py", line 41, in getRandomExternalLink
    domain = requests.utils.urlparse(startingPage).scheme+"://"+urlparse(startin
gPage).netloc
NameError: name 'urlparse' is not defined

Not sure how urlparse is not defined when it's part of requests modul. If you have any suggestion how to improve this code I'll be glad to hear it. Also, have an impression that it's too complicated but I'm only following a book.

***metulburr*** · (This post was last modified: Nov-30-2018, 12:14 AM by metulburr.)

(Nov-29-2018, 11:40 PM)Truman Wrote:
Error:
urlparse(startin gPage).netloc

its not preceded by requests.utils

Truman · (This post was last modified: Nov-30-2018, 12:43 AM by Truman.)

It's not preceeded in line 12 neither but there is no problem there. Why?

p.s. now I see that I have to add it there too and finally the program works! Will add some extra function tomorrow with an another question. Stay tuned!

Truman · Nov-30-2018, 11:40 PM

this piece of code can be added too:

# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
	html = requests.get(siteUrl)
	domain = requests.utils.urlparse(siteUrl).scheme+"://"+requests.utils.urlparse(siteUrl).netloc
	bsObj = BeautifulSoup(html.content, 'html.parser')
	internalLinks = getInternalLinks(bsObj, domain)
	externalLinks = getAllExternalLinks(bsObj, domain)
	
for link in externalLinks:
    if link not in allExtLinks:
        allExtLinks.add(link)
        print(link)
for link in internalLinks:
	if link not in allIntLinks:
		allIntLinks.add(links)
		getAllExternalLinks(link)
			
allIntLinks.add("http://oreilly.com")
getAllExternalLinks("http://oreilly.com")

question: why do we need both bsObj and domain if they parse the same webpage?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,887	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	use Xpath in Python :: libxml2 for a page-to-page skip-setting	apollo	2	3,637	Mar-19-2020, 06:13 PM Last Post: apollo
	Web Crawler help	takaa	39	27,226	Apr-26-2019, 12:14 PM Last Post: stateitreal

crawler: Get external page

User Panel Messages

Announcements