Python Forum

• A single function, getLinks, that takes in a Wikipedia article URL of the form /
wiki/<Article_Name> and returns a list of all linked article URLs in the same
form.
• A main function that calls getLinks with some starting article, chooses a random
article link from the returned list, and calls getLinks again, until we stop the
program or until there are no article links found on the new page.

code:

import requests
from bs4 import BeautifulSoup
import re
import datetime
import random

random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = requests.get("http://en.wikipedia.org" + articleUrl)
    bsObj = BeautifulSoup(html.content, 'html.parser')
    return bsObj.find("div", {"id":"bodyContent"}).find_all("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
	newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
	print(newArticle)
	links = getLinks(newArticle)

line 11 has caret character. What I don't understand is why is it used. The point of this program is to find all links on Kevin Bacon WP page that lead to other articles. These urls have "wiki" so why is then a negative character class used? Interestingly, when I removed caret character results made more sense. Not sure why author of the book used, I hope that you can figure this out. Idea

It's not a negative character class. ^ only indicates a negative character class if it is the first character in brackets. Otherwise it matches the beginning of the string. So that only matches strings that start with '/wiki/'.

Interesting, read several sources on regex and didn't find that explanation. Thanks.

It's in the Python re documentation, which is where I learned regexes from.

Truman

ichabod801

Truman

ichabod801