Extracting all the links on a website

randeniyamohan · (This post was last modified: Jan-09-2020, 10:57 AM by Larz60+.)

I am trying to build a web crawler to extract all the links on a webpage. I have created 2 python files. (class: scanner.py and object: vulnerability-scanner.py). When I run the script there is an error shows up. I am unable to find the error. Help me to solve this.

Source code
----------------------------------------------------------------------------
scanner.py

import requests
import re
import urllib.parse

class Scanner:
    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def extract_links_from(self, url):
        response = requests.get(url)
        return re.findall('"((http|ftp)s?://.*?)"', response.content.decode('utf-8'))

    def crawl(self, url=None):
        if url == None:
            url = self.target_url
        href_links = self.extract_links_from(url)
        for link in href_links:
            link = urllib.parse.urljoin(url, link)

            if '#' in link:
                link = link.split("#")[0]

            if self.target_url in link and link not in self.target_links:
                self.target_links.append(link)
                print(link)
                self.crawl(link)

------------------------------------------------------------------------------------------

vulnerability-scanner.py

import scanner

target_url = "http://localhost/DVWA/"
vul_scanner = scanner.Scanner(target_url)
vul_scanner.crawl(target_url)

-------------------------------------------------------------------------------------------
error

Error:Traceback (most recent call last):
  File "C:/xampp/htdocs/WebVIM/vulnerability-scanner.py", line 5, in <module>
    vul_scanner.crawl(target_url)
  File "C:\xampp\htdocs\WebVIM\scanner.py", line 19, in crawl
    link = urllib.parse.urljoin(url, link)
  File "C:\Users\HouseMoNaRa\AppData\Local\Programs\Python\Python37-32\lib\urllib\parse.py", line 487, in urljoin
    base, url, _coerce_result = _coerce_args(base, url)
  File "C:\Users\HouseMoNaRa\AppData\Local\Programs\Python\Python37-32\lib\urllib\parse.py", line 120, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	All product links to products on a website	MarionStorm	0	1,080	Jun-02-2022, 11:17 PM Last Post: MarionStorm
	Extracting links from website with selenium bs4 and python	M1ck0	1	3,725	Jul-20-2019, 10:29 PM Last Post: Larz60+
	webscrapping links and then enter those links to scrape data	kirito85	2	3,197	Jun-13-2019, 02:23 AM Last Post: kirito85
	Download all secret links from a map design website	fyec	0	2,841	Jul-24-2018, 09:08 PM Last Post: fyec

Extracting all the links on a website

User Panel Messages

Announcements