Make a Web crawler without using the recursion method.

hibritusta · Jul-06-2019, 08:16 PM

I'm working on a Crawler application.
My goal is to write a broken link detection tool.
So I'm going to check all the links on a site.

When crawl() is used after 966 times, it generates the following error.

Error:
RecursionError: maximum recursion depth exceeded in comparison

To solve the problem, I used the following code and increased the Recursion limit.

import sys
sys.setrecursionlimit(30000)

However, when the recursion count is 3924, the python application closes.

I'm trying to solve the problem with another method without using the Recursion method as a solution.
How do I perform a web crawler with a loop or another method without using the recursion method?

Web crawler My code

#!usr/bin/env python 3.7.2
# -*- coding: utf-8 -*-
import requests
import re
from urllib.parse import urljoin
target_links=[]
url_Adres="http://crummy.com"

def get_links( url):
    try:
        if "http://" in url or "https://" in url:
            response = requests.get(url)
            return re.findall('(?:href=")(.*?)"', str(response.content))
        else:
            response = requests.get("http://" + url)
            return re.findall('(?:href=")(.*?)"', str(response.content))
    except requests.exceptions.ConnectionError:
        pass
    except requests.exceptions.InvalidSchema:
        pass
    except requests.exceptions.InvalidURL:
        pass
    except UnicodeError:
        pass


def crawl( url):
    href_links = get_links(url)
    if href_links:
        for link in href_links:
            link = urljoin(url, link)
            if "#" in link:
                link = link.split("#")[0]
            if url_Adres in link and link not in target_links:
                target_links.append(link)
                print("Crawler:"+link)
                crawl(link)
crawl(url_Adres)

***ichabod801*** · Jul-06-2019, 09:15 PM

Have a list of links to crawl, initialized as [url]. As long as there is something in that list, pop a link out, scrape the links off that page, and add the valid ones to the list of links to crawl.

This does not exactly solve the problem, but it shifts a recursion limit problem to a memory problem, which may be sufficient.

hibritusta · Jul-06-2019, 09:38 PM

(Jul-06-2019, 09:15 PM)ichabod801 Wrote: Have a list of links to crawl, initialized as [url]. As long as there is something in that list, pop a link out, scrape the links off that page, and add the valid ones to the list of links to crawl.

This does not exactly solve the problem, but it shifts a recursion limit problem to a memory problem, which may be sufficient.

I am sorry.I couldn't fully understand.
Can you change my code as you say?

**Gribouillis** · Jul-06-2019, 09:47 PM

Crawling urls is only an example of a graph's breadth first traversal algorithm. You have a pseudo code in wikipedia which you can adapt to python by using a list as ichabod801 suggests, or a collections.deque instance, which makes it easy to implement a FIFO.

hibritusta · Jul-06-2019, 10:09 PM

(Jul-06-2019, 09:47 PM)Gribouillis Wrote: Crawling urls is only an example of a graph's breadth first traversal algorithm. You have a pseudo code in wikipedia which you can adapt to python by using a list as ichabod801 suggests, or a collections.deque instance, which makes it easy to implement a FIFO.

I'm not familiar with the algorithms you say.
I'm investigating.

noisefloor · Jul-07-2019, 11:46 AM

Hi,

Quote: I am sorry.I couldn't fully understand.

Don't call the get_url() recursively, but have a collections.deque object instead where you push new links on the end on consume links to check from the beginning of the deque. When the deque is empty, you are done. No recursion needed.

You could do this with a list instead of collections.deque as well, but deque is more efficient when consuming data from the beginning.

Regards, noisefloor

rootVIII · (This post was last modified: Jul-08-2019, 11:39 PM by rootVIII.)

Hmmm maybe like this?

#! /usr/bin/python3


def recursive_urls(links):
    if len(links) > 0:
        # remove first element of list:
        current = links.pop(0)
        # instead of the print statement,
        # make your request/do business:
        print(current +'\n')
        # recursively call the method again
        # and pass remaining links:
        recursive_urls(links)


if __name__ == '__main__':
    urls = [
        'http://test1.com',
        'https://test2.com',
        'https://test3.com',
        'https://test4.com'
    ]
    recursive_urls(urls)

EDIT: My apologies I thought you were trying to use the recursion method... not avoid it :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,880	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	Web Crawler help	takaa	39	27,208	Apr-26-2019, 12:14 PM Last Post: stateitreal

Make a Web crawler without using the recursion method.

User Panel Messages

Announcements