Web Crawler help - Printable Version

Web Crawler help - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Web Crawler help (/thread-25549.html)

Web Crawler help - Mr_Mafia - Apr-02-2020

This may be a more non-conventional way of asking for help with my program, so forgive me. As the title suggest, I've been trying to build a web crawler by following a certain tutorial that dates back to 2016. The reason I've been following this tutorial series in particular is because of how cohesive it is for the most part. I'm nearing the end of fully making it, but when I run the program, nothing happens. The tutorial series I'm talking about comes from the youtube channel thenewboston.

Here's the URL for the video in which I'm stuck: https://www.youtube.com/watch?v=vKFc3-5Y17U&t=441s

I'm asking those in particular who have followed this series in the past if they were able to successfully get this running correctly. This question is a long shot, but I'm too stubborn to quit on this project just yet. If there are any people out there who did watch this mini series, and got it working, I'll happily share all my coding along with the notes I have in it if you are willing to help me.

RE: Web Crawler help - Larz60+ - Apr-02-2020

suggest (on this forum)
web scraping part 1
web scraping part 2

RE: Web Crawler help - Mr_Mafia - Apr-04-2020

Thank you, but I think I found out why it doesn't work and still need a little help.
Here is the code:

def __init__(self,project_name,base_url,domain_name):
        spider.project_name=project_name
        spider.base_url=base_url
        spider.domain_name=domain_name
        spider.queue_file=spider.project_name+'/queue.txt'#setting up the file path for the queue text file
        spider.crawled_file=spider.project_name+'/crawled.txt'
        self.boot()
        self.crawl_page('First spider',spider.base_url)#first spider is crawling the main page of website

def crawl_page(thread_name,page_url):#method displaying that the webpage is currently being crawled, so the user knows it's working
        if page_url not in spider.crawled:#using crawled set due to faster operations NOT the file
            print(thread_name+' currently crawling '+page_url)
            print('Queue ' +str(len(spider.queue)) + ' | cralwed ' +str(len(spider.crawled)) )#'spider.queue' is by default an integer, so 'len()' is used to get how many items are in the set, and 'str()' is used to convert everything to string.
            spider.add_links_to_queue(spider.gather_links(page_url))#'spider.gather_links(page_url)' will connect to a web page and gather the links there. 'spider.add_links_to_queue' will take those gathered links and add them to the waiting list
            spider.queue.remove(page_url)#removing the links from the queue set 
            spider.crawled.add(page_url)#adding the removed links to the crawled set 
            spider.update_files()

line 20, in __init__
self.crawl_page("First spider",spider.base_url)#first spider is crawling the main page of website
TypeError: crawl_page() takes 2 positional arguments but 3 were given

I'm not seeing how I'm giving a third argument when calling upon crawl_page()