Web Crawler help - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Web Crawler help (/thread-25549.html) |
Web Crawler help - Mr_Mafia - Apr-02-2020 This may be a more non-conventional way of asking for help with my program, so forgive me. As the title suggest, I've been trying to build a web crawler by following a certain tutorial that dates back to 2016. The reason I've been following this tutorial series in particular is because of how cohesive it is for the most part. I'm nearing the end of fully making it, but when I run the program, nothing happens. The tutorial series I'm talking about comes from the youtube channel thenewboston. Here's the URL for the video in which I'm stuck: https://www.youtube.com/watch?v=vKFc3-5Y17U&t=441s I'm asking those in particular who have followed this series in the past if they were able to successfully get this running correctly. This question is a long shot, but I'm too stubborn to quit on this project just yet. If there are any people out there who did watch this mini series, and got it working, I'll happily share all my coding along with the notes I have in it if you are willing to help me. RE: Web Crawler help - Larz60+ - Apr-02-2020 suggest (on this forum) web scraping part 1 web scraping part 2 RE: Web Crawler help - Mr_Mafia - Apr-04-2020 Thank you, but I think I found out why it doesn't work and still need a little help. Here is the code: def __init__(self,project_name,base_url,domain_name): spider.project_name=project_name spider.base_url=base_url spider.domain_name=domain_name spider.queue_file=spider.project_name+'/queue.txt'#setting up the file path for the queue text file spider.crawled_file=spider.project_name+'/crawled.txt' self.boot() self.crawl_page('First spider',spider.base_url)#first spider is crawling the main page of website def crawl_page(thread_name,page_url):#method displaying that the webpage is currently being crawled, so the user knows it's working if page_url not in spider.crawled:#using crawled set due to faster operations NOT the file print(thread_name+' currently crawling '+page_url) print('Queue ' +str(len(spider.queue)) + ' | cralwed ' +str(len(spider.crawled)) )#'spider.queue' is by default an integer, so 'len()' is used to get how many items are in the set, and 'str()' is used to convert everything to string. spider.add_links_to_queue(spider.gather_links(page_url))#'spider.gather_links(page_url)' will connect to a web page and gather the links there. 'spider.add_links_to_queue' will take those gathered links and add them to the waiting list spider.queue.remove(page_url)#removing the links from the queue set spider.crawled.add(page_url)#adding the removed links to the crawled set spider.update_files()line 20, in __init__ self.crawl_page("First spider",spider.base_url)#first spider is crawling the main page of website TypeError: crawl_page() takes 2 positional arguments but 3 were given I'm not seeing how I'm giving a third argument when calling upon crawl_page() |