The python script is continuously running

randeniyamohan · (This post was last modified: Jan-11-2020, 02:07 PM by randeniyamohan.)

Hi, I am trying to build a web crawler to extract all the links on a webpage. I have created 2 python files. (class: scanner.py and object: vulnerability-scanner.py). When I run the script, it is continuously running without stopping. I am unable to find the error. Help me to solve this.

Here is my source code:

scanner.py

import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama

class Scanner:

    colorama.init()

    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def is_valid(self, url):
        parsed = urlparse(url)
        return bool(parsed.netloc) and bool(parsed.scheme)

    def get_all_website_links(self, url):

        GREEN = colorama.Fore.GREEN
        WHITE = colorama.Fore.WHITE
        RESET = colorama.Fore.RESET

        urls = set()
        internal_urls = set()
        external_urls = set()
        domain_name = urlparse(url).netloc
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        for a_tag in soup.findAll("a"):
            href = a_tag.attrs.get("href")
            if href == "" or href is None:
                continue
            href = urljoin(url, href)
            parsed_href = urlparse(href)
            href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path

            if not self.is_valid(href):
                continue
            if href in internal_urls:
                continue
            if domain_name not in href:
                if href not in external_urls:
                    print(f"{WHITE}[*] External link: {href}{RESET}")
                    external_urls.add(href)
                continue
            print(f"{GREEN}[*] Internal link: {href}{RESET}")
            urls.add(href)
            internal_urls.add(href)
        return urls

    def crawl(self, url):
        href_links = self.get_all_website_links(url)
        for link in href_links:
            print(link)
            self.crawl(link)

vulnerability-scanner.py

import argu

target_url = "https://hack.me/"
vul_scanner = argu.Scanner(target_url)
vul_scanner.crawl(target_url)

Output:C:\xampp\htdocs\WebVIM\venv\Scripts\python.exe C:/xampp/htdocs/WebVIM/argument.py
[*] Internal link: https://hack.me/
[*] Internal link: https://hack.me/explore/
[*] Internal link: https://hack.me/faq
[*] Internal link: https://hack.me/about
[*] Internal link: https://me.hack.me/login
[*] Internal link: https://me.hack.me/signup
[*] Internal link: https://hack.me/s/
[*] Internal link: https://me.hack.me/developer.php
[*] External link: http://www.eLearnSecurity.com
[*] External link: http://www.elearnsecurity.com
[*] Internal link: https://hack.me/hackmeterms.txt
[*] External link: https://twitter.com/hackmeproject
[*] External link: https://www.facebook.com/hackmeproject
https://hack.me/about
[*] Internal link: https://hack.me/
[*] Internal link: https://hack.me/explore/
[*] Internal link: https://hack.me/faq
[*] Internal link: https://hack.me/about
[*] Internal link: https://me.hack.me/login
[*] Internal link: https://me.hack.me/signup
[*] External link: http://www.elearnsecurity.com
[*] Internal link: https://hack.me/hackmeterms.txt
[*] External link: https://twitter.com/Giutro
[*] External link: https://twitter.com/eLearnSecurity
[*] External link: https://hackmeproject.uservoice.com/
[*] External link: https://www.elearnsecurity.com/course/
[*] External link: https://twitter.com/hackmeproject
[*] External link: https://www.facebook.com/hackmeproject
https://hack.me/about

Process finished with exit code -1

ibreeden · Jan-12-2020, 10:29 AM

The repetition is in the Scanner.crawl() method. This method is recursively calling itself in line 56. When the first page contains a link to itself, the crawl will start again from the begin. Obviously this does not need to be the first page, this behaviour would always occur.
I see the __init__() method initializes self.target_links = [] which is not used. I would suggest to make target_links a set instead of a list and use it to filter links already visited. (Because sets can easily be filtered, for example: return urls - self.target_links.)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Safely running a web scraping script	londonhdi	1	1,868	Feb-17-2020, 08:08 AM Last Post: Larz60+

The python script is continuously running

User Panel Messages

Announcements