Nov-23-2018, 01:11 AM
• A single function, getLinks, that takes in a Wikipedia article URL of the form /
wiki/<Article_Name> and returns a list of all linked article URLs in the same
form.
• A main function that calls getLinks with some starting article, chooses a random
article link from the returned list, and calls getLinks again, until we stop the
program or until there are no article links found on the new page.
code:
wiki/<Article_Name> and returns a list of all linked article URLs in the same
form.
• A main function that calls getLinks with some starting article, chooses a random
article link from the returned list, and calls getLinks again, until we stop the
program or until there are no article links found on the new page.
code:
import requests from bs4 import BeautifulSoup import re import datetime import random random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = requests.get("http://en.wikipedia.org" + articleUrl) bsObj = BeautifulSoup(html.content, 'html.parser') return bsObj.find("div", {"id":"bodyContent"}).find_all("a", href=re.compile("^(/wiki/)((?!:).)*$")) links = getLinks("/wiki/Kevin_Bacon") while len(links) > 0: newArticle = links[random.randint(0, len(links)-1)].attrs["href"] print(newArticle) links = getLinks(newArticle)line 11 has caret character. What I don't understand is why is it used. The point of this program is to find all links on Kevin Bacon WP page that lead to other articles. These urls have "wiki" so why is then a negative character class used? Interestingly, when I removed caret character results made more sense. Not sure why author of the book used, I hope that you can figure this out.
