Python Forum
im completely stuck (webcrawler)
Thread Rating:
  • 2 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
im completely stuck (webcrawler)
#1
So I recently started working on a webcrawle project, literally starting from scratch with the help of a few tutorials, but midway I just got very much lost with them

The spider, the linkfinder and the domain parser (at least i think thats how it is called) does literally nothing, even though they should, and im still getting exit code 0 (actually with the domain py im getting exit code -1 error message, even though the IDE itself shows no error message

The IDE is using is PyCharm 20171.1


This is the code for the linkfinder:
from html.parser import HTMLParser
from urllib import parse
from general import*

class LinkFinder(HTMLParser):

   def __init__(self, base_url, page_url):
       super().__init__()
       self.base_url = base_url
       self.page_url = page_url
       self.links = set()

   # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for (attribute, value) in attrs:
               if attribute == 'href':
                   url = parse.urljoin(self.base_url, value)
                   self.links.add(url)

   def page_links(self):
       return self.links

   def error(self, message):
       pass




and this one is for the spider:

from html.parser import HTMLParser
from urllib import parse
from general import*

class LinkFinder(HTMLParser):

   def __init__(self, base_url, page_url):
       super().__init__()
       self.base_url = base_url
       self.page_url = page_url
       self.links = set()

   # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for (attribute, value) in attrs:
               if attribute == 'href':
                   url = parse.urljoin(self.base_url, value)
                   self.links.add(url)

   def page_links(self):
       return self.links

   def error(self, message):
       pass
And this one is what I try to use for the domain parsing part (for which im getting exit code 1 error)

from urllib.parse import urlparse


# Get domain name (example.com)
def get_domain_name(url):
   try:
       results = get_sub_domain_name(url).split('.')
       return results[-2] + '.' + results[-1]
   except:
       return ''


# Get sub domain name (name.example.com)
def get_sub_domain_name(url):
   try:
       return urlparse(url).netloc
   except:
       return ''

print(get_domain_name(www.startlap.hu))
I would really appreciate any help as to where did I screwed up the code, as Im very much out of ideas ><

Thank you in advance for your help

For reference sake: i took the files from: https://github.com/buckyroberts/Spider to basically experiment on them and see how they works. Sadly despite following the youtube tutorial that goes alongside with it, I basically got nowhere
Reply
#2
Try placing some print statements at critical points in your software, followed by
input()
Make sure what you expect, is actually what you get!

If you use an IDE with a debugger, use it, it really does speed things up
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Still not completely clear about namespaces... bytecrunch 3 1,961 Oct-07-2022, 05:44 PM
Last Post: bytecrunch
  my python intrepeter completely broke Underscore 9 3,849 Oct-12-2021, 06:57 PM
Last Post: deanhystad
  Completely new to coding - quick question Oster22 1 2,707 Jun-19-2018, 08:42 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020