Python Forum
Extracting all the links on a website
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting all the links on a website
#1
I am trying to build a web crawler to extract all the links on a webpage. I have created 2 python files. (class: scanner.py and object: vulnerability-scanner.py). When I run the script there is an error shows up. I am unable to find the error. Help me to solve this.

Source code
----------------------------------------------------------------------------
scanner.py
import requests
import re
import urllib.parse

class Scanner:
    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def extract_links_from(self, url):
        response = requests.get(url)
        return re.findall('"((http|ftp)s?://.*?)"', response.content.decode('utf-8'))

    def crawl(self, url=None):
        if url == None:
            url = self.target_url
        href_links = self.extract_links_from(url)
        for link in href_links:
            link = urllib.parse.urljoin(url, link)

            if '#' in link:
                link = link.split("#")[0]

            if self.target_url in link and link not in self.target_links:
                self.target_links.append(link)
                print(link)
                self.crawl(link)
------------------------------------------------------------------------------------------

vulnerability-scanner.py
import scanner

target_url = "http://localhost/DVWA/"
vul_scanner = scanner.Scanner(target_url)
vul_scanner.crawl(target_url)
-------------------------------------------------------------------------------------------
error
Error:
Traceback (most recent call last): File "C:/xampp/htdocs/WebVIM/vulnerability-scanner.py", line 5, in <module> vul_scanner.crawl(target_url) File "C:\xampp\htdocs\WebVIM\scanner.py", line 19, in crawl link = urllib.parse.urljoin(url, link) File "C:\Users\HouseMoNaRa\AppData\Local\Programs\Python\Python37-32\lib\urllib\parse.py", line 487, in urljoin base, url, _coerce_result = _coerce_args(base, url) File "C:\Users\HouseMoNaRa\AppData\Local\Programs\Python\Python37-32\lib\urllib\parse.py", line 120, in _coerce_args raise TypeError("Cannot mix str and non-str arguments") TypeError: Cannot mix str and non-str arguments
Reply
#2
Maybe replace line 19 with this:
link = urllib.parse.urljoin(str(url), str(link))
That's where the problem hits, but that may not be where it starts. If that doesn't work you will need to print the url and the link to see what's going in and breaking it.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  All product links to products on a website MarionStorm 0 1,080 Jun-02-2022, 11:17 PM
Last Post: MarionStorm
  Extracting links from website with selenium bs4 and python M1ck0 1 3,725 Jul-20-2019, 10:29 PM
Last Post: Larz60+
  webscrapping links and then enter those links to scrape data kirito85 2 3,197 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Download all secret links from a map design website fyec 0 2,841 Jul-24-2018, 09:08 PM
Last Post: fyec

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020