Python Forum

Full Version: Extracting link list to json file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey guys,
Got an assignment to pass the internship knowledge limitation.
The mission was to build a basic spider to crawl sitemap and save the output in JSON.

the instructions was below:
A. Create a class that gets a weblink from the user
B. The class must map all the pages on the site recursively
C. You must save all paths (PHP files) and the number of images in each path in a JSON file
Note: Do not use automated tools such as Dirbester, BeautifulSoup , bs4 and etc..

the output should be look like this:

{
"courses":
{
"courses/python.php":2,
"courses/linux.php":7,
"courses/mobile_hacking.php":1
},
"blogs":
{
"blogs/google_hack.php":3,
"blogs/alexa_hack.php":1,
"blogs/whatsapp_hack.php":5
}
}

I've got so far with my code, but i don't know how to really make this output and i am working on in for a couple of weeks tried everything on the web and skilled friends.
Would appreciate any help Thanks !! Blush

import requests
import threading
import os
import json

threads = []


# noinspection SpellCheckingInspection
class ITCrowlerSafe(object):  # Main class
    link = ''
    links = []
    global weblink

    def __init__(self):  # Getting the webCode and extracting to a text file for the function can do readlines
        self.web = weblink
        self.output = requests.get(self.web)
        self.handle()
        os.remove("output.txt")

    def __call__(self, *args, **kwargs):  # Main function
        with open('links.txt', 'w') as cfile:
            for line in self.data:
                index = 0
                while index < len(line):  # Staying at each line for making sure no href has been left behind
                    index = line.find('href', index)  # Setting first occurrence
                    if index == -1:  # If not found..
                        break

                    pend = line.find('"', index + 7)  # Setting end marker
                    link = line[index:pend]

                    if "javascript:" in link or len(link) < 10 or \
                            "tel" in link or "mailto" in link or \
                            "target" in link or "</" in link or \
                            "href=" not in link or ".ico" in link or \
                            ".jpeg" in link or ".png" in link:  # Dealing with cases for the link to be legal
                        break

                    if ">" in link:
                        pend = line.find('>', index + 7)
                        link = line[index:pend]
                    index += 5

                    self.links.append(link)
                    cfile.write("\n{0}\n".format(link))

    def handle(self):
        with open('output.txt', 'w') as txf:
            txf.write(self.output.text)
        with open('output.txt', 'r+') as rfile:
            self.data = rfile.readlines()


weblink = "https://www.itsafe.co.il/"
crawler = ITCrowlerSafe()
t = threading.Thread(target=crawler)
t.start()

threads.append(t)
for thread in threads:
    thread.join()
[
print(crawler.links)
anybody please?
What's your question? A good question would be something like, "here's my actual output, here's my desired output, and here's my attempt." You've provided the desired output, but you haven't said what's blocking you from turning what you have into what you want. The question should also be minimal - if you're satisfied with your code so far, you can frame the question in terms of that result, and how you're trying to process it.

That aside, I see issues here. There's no need for a global variable. Why are you spawning a single thread? The file stuff you're doing is hacky and unnecessary. You're using a class variable, which might be fine here but wouldn't scale. Your class API is very weird. I would suggest seriously cleaning this up before submitting it, even if you can get it "working" easily.
(Sep-16-2020, 08:41 PM)micseydel Wrote: [ -> ]What's your question? A good question would be something like, "here's my actual output, here's my desired output, and here's my attempt." You've provided the desired output, but you haven't said what's blocking you from turning what you have into what you want. The question should also be minimal - if you're satisfied with your code so far, you can frame the question in terms of that result, and how you're trying to process it.

That aside, I see issues here. There's no need for a global variable. Why are you spawning a single thread? The file stuff you're doing is hacky and unnecessary. You're using a class variable, which might be fine here but wouldn't scale. Your class API is very weird. I would suggest seriously cleaning this up before submitting it, even if you can get it "working" easily.

My question was written in black on white

I've got so far with my code, but i don't know how to really make this output and i am working on in for a couple of weeks tried everything on the web and skilled friends.
Would appreciate any help Thanks !! Blush

I am a relatively new developer.
I ask questions to learn. Don't forget everyone start at this position.
Despite the condescending tone, thank you for your feedback and for the time you devoted
(Sep-17-2020, 08:04 AM)naor Wrote: [ -> ]i don't know how to really make this output
This is not a question but a statement of fact. If you knew you would not be here looking for help. I strongly suggest you read and follow the advice in @micseydel post
My intent is not to be condescending. I welcome you to suggest exactly how you would re-word my post. I was trying to be helpful; it seemed your post was (is) unlikely to get a helpful answer, and I was trying to help you improve the post. I was also trying to flag things that I, as someone who has evaluated candidate "take home" assignments before, would flag in a candidate. You can ignore my feedback, but if the people looking at your assignment are anything like me, they'll feel like you still have more to learn before you could join. (And by the way, if your response to "please help us to help you" is "My question was written in black on white", you're unlikely to make progress.)

Also, I hadn't mentioned this before because I was trying to gentle, but since the cat is out of the bag - if you're asking for help on a forum with an internship challenge question, you're probably not qualified to have that internship. If I were the person evaluating you, and I Googled some of your code and found this thread, I would probably disqualify you for cheating (and probably ban you from joining the company full-time in the future). Cheating aside, I would disqualify you for your attitude - you asked a question on the internet, didn't get a reply, made a second request, and when the free volunteer-based answer you got was a request for clarity, you said that you were actually clear and then name-called me. Would you hire an intern with that behavior?