Python Forum

Full Version: im completely stuck (webcrawler)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
So I recently started working on a webcrawle project, literally starting from scratch with the help of a few tutorials, but midway I just got very much lost with them

The spider, the linkfinder and the domain parser (at least i think thats how it is called) does literally nothing, even though they should, and im still getting exit code 0 (actually with the domain py im getting exit code -1 error message, even though the IDE itself shows no error message

The IDE is using is PyCharm 20171.1


This is the code for the linkfinder:
from html.parser import HTMLParser
from urllib import parse
from general import*

class LinkFinder(HTMLParser):

   def __init__(self, base_url, page_url):
       super().__init__()
       self.base_url = base_url
       self.page_url = page_url
       self.links = set()

   # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for (attribute, value) in attrs:
               if attribute == 'href':
                   url = parse.urljoin(self.base_url, value)
                   self.links.add(url)

   def page_links(self):
       return self.links

   def error(self, message):
       pass




and this one is for the spider:

from html.parser import HTMLParser
from urllib import parse
from general import*

class LinkFinder(HTMLParser):

   def __init__(self, base_url, page_url):
       super().__init__()
       self.base_url = base_url
       self.page_url = page_url
       self.links = set()

   # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for (attribute, value) in attrs:
               if attribute == 'href':
                   url = parse.urljoin(self.base_url, value)
                   self.links.add(url)

   def page_links(self):
       return self.links

   def error(self, message):
       pass
And this one is what I try to use for the domain parsing part (for which im getting exit code 1 error)

from urllib.parse import urlparse


# Get domain name (example.com)
def get_domain_name(url):
   try:
       results = get_sub_domain_name(url).split('.')
       return results[-2] + '.' + results[-1]
   except:
       return ''


# Get sub domain name (name.example.com)
def get_sub_domain_name(url):
   try:
       return urlparse(url).netloc
   except:
       return ''

print(get_domain_name(www.startlap.hu))
I would really appreciate any help as to where did I screwed up the code, as Im very much out of ideas ><

Thank you in advance for your help

For reference sake: i took the files from: https://github.com/buckyroberts/Spider to basically experiment on them and see how they works. Sadly despite following the youtube tutorial that goes alongside with it, I basically got nowhere
Try placing some print statements at critical points in your software, followed by
input()
Make sure what you expect, is actually what you get!

If you use an IDE with a debugger, use it, it really does speed things up