Sep-16-2020, 10:46 AM
Hey guys,
Got an assignment to pass the internship knowledge limitation.
The mission was to build a basic spider to crawl sitemap and save the output in JSON.
the instructions was below:
A. Create a class that gets a weblink from the user
B. The class must map all the pages on the site recursively
C. You must save all paths (PHP files) and the number of images in each path in a JSON file
Note: Do not use automated tools such as Dirbester, BeautifulSoup , bs4 and etc..
the output should be look like this:
{
"courses":
{
"courses/python.php":2,
"courses/linux.php":7,
"courses/mobile_hacking.php":1
},
"blogs":
{
"blogs/google_hack.php":3,
"blogs/alexa_hack.php":1,
"blogs/whatsapp_hack.php":5
}
}
I've got so far with my code, but i don't know how to really make this output and i am working on in for a couple of weeks tried everything on the web and skilled friends.
Would appreciate any help Thanks !!
Got an assignment to pass the internship knowledge limitation.
The mission was to build a basic spider to crawl sitemap and save the output in JSON.
the instructions was below:
A. Create a class that gets a weblink from the user
B. The class must map all the pages on the site recursively
C. You must save all paths (PHP files) and the number of images in each path in a JSON file
Note: Do not use automated tools such as Dirbester, BeautifulSoup , bs4 and etc..
the output should be look like this:
{
"courses":
{
"courses/python.php":2,
"courses/linux.php":7,
"courses/mobile_hacking.php":1
},
"blogs":
{
"blogs/google_hack.php":3,
"blogs/alexa_hack.php":1,
"blogs/whatsapp_hack.php":5
}
}
I've got so far with my code, but i don't know how to really make this output and i am working on in for a couple of weeks tried everything on the web and skilled friends.
Would appreciate any help Thanks !!

import requests import threading import os import json threads = [] # noinspection SpellCheckingInspection class ITCrowlerSafe(object): # Main class link = '' links = [] global weblink def __init__(self): # Getting the webCode and extracting to a text file for the function can do readlines self.web = weblink self.output = requests.get(self.web) self.handle() os.remove("output.txt") def __call__(self, *args, **kwargs): # Main function with open('links.txt', 'w') as cfile: for line in self.data: index = 0 while index < len(line): # Staying at each line for making sure no href has been left behind index = line.find('href', index) # Setting first occurrence if index == -1: # If not found.. break pend = line.find('"', index + 7) # Setting end marker link = line[index:pend] if "javascript:" in link or len(link) < 10 or \ "tel" in link or "mailto" in link or \ "target" in link or "</" in link or \ "href=" not in link or ".ico" in link or \ ".jpeg" in link or ".png" in link: # Dealing with cases for the link to be legal break if ">" in link: pend = line.find('>', index + 7) link = line[index:pend] index += 5 self.links.append(link) cfile.write("\n{0}\n".format(link)) def handle(self): with open('output.txt', 'w') as txf: txf.write(self.output.text) with open('output.txt', 'r+') as rfile: self.data = rfile.readlines() weblink = "https://www.itsafe.co.il/" crawler = ITCrowlerSafe() t = threading.Thread(target=crawler) t.start() threads.append(t) for thread in threads: thread.join() [ print(crawler.links)