Extracting all the links on a website - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Extracting all the links on a website (/thread-23628.html) |
Extracting all the links on a website - randeniyamohan - Jan-09-2020 I am trying to build a web crawler to extract all the links on a webpage. I have created 2 python files. (class: scanner.py and object: vulnerability-scanner.py). When I run the script there is an error shows up. I am unable to find the error. Help me to solve this. Source code ---------------------------------------------------------------------------- scanner.py import requests import re import urllib.parse class Scanner: def __init__(self, url): self.target_url = url self.target_links = [] def extract_links_from(self, url): response = requests.get(url) return re.findall('"((http|ftp)s?://.*?)"', response.content.decode('utf-8')) def crawl(self, url=None): if url == None: url = self.target_url href_links = self.extract_links_from(url) for link in href_links: link = urllib.parse.urljoin(url, link) if '#' in link: link = link.split("#")[0] if self.target_url in link and link not in self.target_links: self.target_links.append(link) print(link) self.crawl(link)------------------------------------------------------------------------------------------ vulnerability-scanner.py import scanner target_url = "http://localhost/DVWA/" vul_scanner = scanner.Scanner(target_url) vul_scanner.crawl(target_url)------------------------------------------------------------------------------------------- error
RE: Extracting all the links on a website - Clunk_Head - Jan-09-2020 Maybe replace line 19 with this: link = urllib.parse.urljoin(str(url), str(link))That's where the problem hits, but that may not be where it starts. If that doesn't work you will need to print the url and the link to see what's going in and breaking it. |