Python Forum
Extracting link list to json file
Thread Rating:
  • 1 Vote(s) - 1 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting link list to json file
#1
Hey guys,
Got an assignment to pass the internship knowledge limitation.
The mission was to build a basic spider to crawl sitemap and save the output in JSON.

the instructions was below:
A. Create a class that gets a weblink from the user
B. The class must map all the pages on the site recursively
C. You must save all paths (PHP files) and the number of images in each path in a JSON file
Note: Do not use automated tools such as Dirbester, BeautifulSoup , bs4 and etc..

the output should be look like this:

{
"courses":
{
"courses/python.php":2,
"courses/linux.php":7,
"courses/mobile_hacking.php":1
},
"blogs":
{
"blogs/google_hack.php":3,
"blogs/alexa_hack.php":1,
"blogs/whatsapp_hack.php":5
}
}

I've got so far with my code, but i don't know how to really make this output and i am working on in for a couple of weeks tried everything on the web and skilled friends.
Would appreciate any help Thanks !! Blush

import requests
import threading
import os
import json

threads = []


# noinspection SpellCheckingInspection
class ITCrowlerSafe(object):  # Main class
    link = ''
    links = []
    global weblink

    def __init__(self):  # Getting the webCode and extracting to a text file for the function can do readlines
        self.web = weblink
        self.output = requests.get(self.web)
        self.handle()
        os.remove("output.txt")

    def __call__(self, *args, **kwargs):  # Main function
        with open('links.txt', 'w') as cfile:
            for line in self.data:
                index = 0
                while index < len(line):  # Staying at each line for making sure no href has been left behind
                    index = line.find('href', index)  # Setting first occurrence
                    if index == -1:  # If not found..
                        break

                    pend = line.find('"', index + 7)  # Setting end marker
                    link = line[index:pend]

                    if "javascript:" in link or len(link) < 10 or \
                            "tel" in link or "mailto" in link or \
                            "target" in link or "</" in link or \
                            "href=" not in link or ".ico" in link or \
                            ".jpeg" in link or ".png" in link:  # Dealing with cases for the link to be legal
                        break

                    if ">" in link:
                        pend = line.find('>', index + 7)
                        link = line[index:pend]
                    index += 5

                    self.links.append(link)
                    cfile.write("\n{0}\n".format(link))

    def handle(self):
        with open('output.txt', 'w') as txf:
            txf.write(self.output.text)
        with open('output.txt', 'r+') as rfile:
            self.data = rfile.readlines()


weblink = "https://www.itsafe.co.il/"
crawler = ITCrowlerSafe()
t = threading.Thread(target=crawler)
t.start()

threads.append(t)
for thread in threads:
    thread.join()
[
print(crawler.links)
Reply
#2
anybody please?
Reply
#3
What's your question? A good question would be something like, "here's my actual output, here's my desired output, and here's my attempt." You've provided the desired output, but you haven't said what's blocking you from turning what you have into what you want. The question should also be minimal - if you're satisfied with your code so far, you can frame the question in terms of that result, and how you're trying to process it.

That aside, I see issues here. There's no need for a global variable. Why are you spawning a single thread? The file stuff you're doing is hacky and unnecessary. You're using a class variable, which might be fine here but wouldn't scale. Your class API is very weird. I would suggest seriously cleaning this up before submitting it, even if you can get it "working" easily.
Reply
#4
(Sep-16-2020, 08:41 PM)micseydel Wrote: What's your question? A good question would be something like, "here's my actual output, here's my desired output, and here's my attempt." You've provided the desired output, but you haven't said what's blocking you from turning what you have into what you want. The question should also be minimal - if you're satisfied with your code so far, you can frame the question in terms of that result, and how you're trying to process it.

That aside, I see issues here. There's no need for a global variable. Why are you spawning a single thread? The file stuff you're doing is hacky and unnecessary. You're using a class variable, which might be fine here but wouldn't scale. Your class API is very weird. I would suggest seriously cleaning this up before submitting it, even if you can get it "working" easily.

My question was written in black on white

I've got so far with my code, but i don't know how to really make this output and i am working on in for a couple of weeks tried everything on the web and skilled friends.
Would appreciate any help Thanks !! Blush

I am a relatively new developer.
I ask questions to learn. Don't forget everyone start at this position.
Despite the condescending tone, thank you for your feedback and for the time you devoted
Reply
#5
(Sep-17-2020, 08:04 AM)naor Wrote: i don't know how to really make this output
This is not a question but a statement of fact. If you knew you would not be here looking for help. I strongly suggest you read and follow the advice in @micseydel post
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#6
My intent is not to be condescending. I welcome you to suggest exactly how you would re-word my post. I was trying to be helpful; it seemed your post was (is) unlikely to get a helpful answer, and I was trying to help you improve the post. I was also trying to flag things that I, as someone who has evaluated candidate "take home" assignments before, would flag in a candidate. You can ignore my feedback, but if the people looking at your assignment are anything like me, they'll feel like you still have more to learn before you could join. (And by the way, if your response to "please help us to help you" is "My question was written in black on white", you're unlikely to make progress.)

Also, I hadn't mentioned this before because I was trying to gentle, but since the cat is out of the bag - if you're asking for help on a forum with an internship challenge question, you're probably not qualified to have that internship. If I were the person evaluating you, and I Googled some of your code and found this thread, I would probably disqualify you for cheating (and probably ban you from joining the company full-time in the future). Cheating aside, I would disqualify you for your attitude - you asked a question on the internet, didn't get a reply, made a second request, and when the free volunteer-based answer you got was a request for clarity, you said that you were actually clear and then name-called me. Would you hire an intern with that behavior?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Storing version of the downloaded libs using json file Rakshan 3 1,268 Mar-21-2023, 07:50 AM
Last Post: buran
Bug [for a h/w project] How to save and get back dictionary in a .json file in TinyDB. adithya_like_py 4 3,495 Feb-05-2021, 10:49 AM
Last Post: buran
  Extracting elements in a list to form a message using for loop Tony04 2 2,325 Oct-25-2019, 05:55 PM
Last Post: ichabod801
  Extracting list element with user input valve 1 2,534 Mar-11-2019, 07:37 PM
Last Post: Yoriz
  Extracting variable values from labels on csv file using Python Laura 1 2,182 Nov-12-2018, 06:54 PM
Last Post: ichabod801
  Saving to Json file Shambob1874 1 2,434 May-30-2018, 10:00 PM
Last Post: micseydel
  help with extracting and matching values in a text file hoeleeschitt 4 46,043 May-03-2018, 05:47 AM
Last Post: hoeleeschitt

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020