Python Forum
Extracting all the links on a website
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting all the links on a website
#1
I am trying to build a web crawler to extract all the links on a webpage. I have created 2 python files. (class: scanner.py and object: vulnerability-scanner.py). When I run the script there is an error shows up. I am unable to find the error. Help me to solve this.

Source code
----------------------------------------------------------------------------
scanner.py
import requests
import re
import urllib.parse

class Scanner:
    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def extract_links_from(self, url):
        response = requests.get(url)
        return re.findall('"((http|ftp)s?://.*?)"', response.content.decode('utf-8'))

    def crawl(self, url=None):
        if url == None:
            url = self.target_url
        href_links = self.extract_links_from(url)
        for link in href_links:
            link = urllib.parse.urljoin(url, link)

            if '#' in link:
                link = link.split("#")[0]

            if self.target_url in link and link not in self.target_links:
                self.target_links.append(link)
                print(link)
                self.crawl(link)
------------------------------------------------------------------------------------------

vulnerability-scanner.py
import scanner

target_url = "http://localhost/DVWA/"
vul_scanner = scanner.Scanner(target_url)
vul_scanner.crawl(target_url)
-------------------------------------------------------------------------------------------
error
Error:
Traceback (most recent call last): File "C:/xampp/htdocs/WebVIM/vulnerability-scanner.py", line 5, in <module> vul_scanner.crawl(target_url) File "C:\xampp\htdocs\WebVIM\scanner.py", line 19, in crawl link = urllib.parse.urljoin(url, link) File "C:\Users\HouseMoNaRa\AppData\Local\Programs\Python\Python37-32\lib\urllib\parse.py", line 487, in urljoin base, url, _coerce_result = _coerce_args(base, url) File "C:\Users\HouseMoNaRa\AppData\Local\Programs\Python\Python37-32\lib\urllib\parse.py", line 120, in _coerce_args raise TypeError("Cannot mix str and non-str arguments") TypeError: Cannot mix str and non-str arguments
Reply


Messages In This Thread
Extracting all the links on a website - by randeniyamohan - Jan-09-2020, 10:42 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  All product links to products on a website MarionStorm 0 1,080 Jun-02-2022, 11:17 PM
Last Post: MarionStorm
  Extracting links from website with selenium bs4 and python M1ck0 1 3,725 Jul-20-2019, 10:29 PM
Last Post: Larz60+
  webscrapping links and then enter those links to scrape data kirito85 2 3,197 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Download all secret links from a map design website fyec 0 2,841 Jul-24-2018, 09:08 PM
Last Post: fyec

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020