im completely stuck (webcrawler)

TatsuyaHiroki · (This post was last modified: Apr-20-2017, 01:07 PM by metulburr.)

So I recently started working on a webcrawle project, literally starting from scratch with the help of a few tutorials, but midway I just got very much lost with them

The spider, the linkfinder and the domain parser (at least i think thats how it is called) does literally nothing, even though they should, and im still getting exit code 0 (actually with the domain py im getting exit code -1 error message, even though the IDE itself shows no error message

The IDE is using is PyCharm 20171.1

This is the code for the linkfinder:

from html.parser import HTMLParser
from urllib import parse
from general import*

class LinkFinder(HTMLParser):

   def __init__(self, base_url, page_url):
       super().__init__()
       self.base_url = base_url
       self.page_url = page_url
       self.links = set()

   # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for (attribute, value) in attrs:
               if attribute == 'href':
                   url = parse.urljoin(self.base_url, value)
                   self.links.add(url)

   def page_links(self):
       return self.links

   def error(self, message):
       pass




and this one is for the spider:

from html.parser import HTMLParser
from urllib import parse
from general import*

class LinkFinder(HTMLParser):

   def __init__(self, base_url, page_url):
       super().__init__()
       self.base_url = base_url
       self.page_url = page_url
       self.links = set()

   # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for (attribute, value) in attrs:
               if attribute == 'href':
                   url = parse.urljoin(self.base_url, value)
                   self.links.add(url)

   def page_links(self):
       return self.links

   def error(self, message):
       pass

And this one is what I try to use for the domain parsing part (for which im getting exit code 1 error)

from urllib.parse import urlparse


# Get domain name (example.com)
def get_domain_name(url):
   try:
       results = get_sub_domain_name(url).split('.')
       return results[-2] + '.' + results[-1]
   except:
       return ''


# Get sub domain name (name.example.com)
def get_sub_domain_name(url):
   try:
       return urlparse(url).netloc
   except:
       return ''

print(get_domain_name(www.startlap.hu))

I would really appreciate any help as to where did I screwed up the code, as Im very much out of ideas ><

Thank you in advance for your help

For reference sake: i took the files from: https://github.com/buckyroberts/Spider to basically experiment on them and see how they works. Sadly despite following the youtube tutorial that goes alongside with it, I basically got nowhere

**Larz60+** · (This post was last modified: Apr-20-2017, 05:55 PM by Larz60+.)

Try placing some print statements at critical points in your software, followed by

input()

Make sure what you expect, is actually what you get!

If you use an IDE with a debugger, use it, it really does speed things up

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Still not completely clear about namespaces...	bytecrunch	3	2,891	Oct-07-2022, 05:44 PM Last Post: bytecrunch
	my python intrepeter completely broke	Underscore	9	6,330	Oct-12-2021, 06:57 PM Last Post: deanhystad
	Completely new to coding - quick question	Oster22	1	3,385	Jun-19-2018, 08:42 PM Last Post: Larz60+

im completely stuck (webcrawler)

User Panel Messages

Announcements