Apr-20-2017, 12:31 PM
So I recently started working on a webcrawle project, literally starting from scratch with the help of a few tutorials, but midway I just got very much lost with them
The spider, the linkfinder and the domain parser (at least i think thats how it is called) does literally nothing, even though they should, and im still getting exit code 0 (actually with the domain py im getting exit code -1 error message, even though the IDE itself shows no error message
The IDE is using is PyCharm 20171.1
This is the code for the linkfinder:
Thank you in advance for your help
For reference sake: i took the files from: https://github.com/buckyroberts/Spider to basically experiment on them and see how they works. Sadly despite following the youtube tutorial that goes alongside with it, I basically got nowhere
The spider, the linkfinder and the domain parser (at least i think thats how it is called) does literally nothing, even though they should, and im still getting exit code 0 (actually with the domain py im getting exit code -1 error message, even though the IDE itself shows no error message
The IDE is using is PyCharm 20171.1
This is the code for the linkfinder:
from html.parser import HTMLParser from urllib import parse from general import* class LinkFinder(HTMLParser): def __init__(self, base_url, page_url): super().__init__() self.base_url = base_url self.page_url = page_url self.links = set() # When we call HTMLParser feed() this function is called when it encounters an opening tag <a> def handle_starttag(self, tag, attrs): if tag == 'a': for (attribute, value) in attrs: if attribute == 'href': url = parse.urljoin(self.base_url, value) self.links.add(url) def page_links(self): return self.links def error(self, message): pass and this one is for the spider: from html.parser import HTMLParser from urllib import parse from general import* class LinkFinder(HTMLParser): def __init__(self, base_url, page_url): super().__init__() self.base_url = base_url self.page_url = page_url self.links = set() # When we call HTMLParser feed() this function is called when it encounters an opening tag <a> def handle_starttag(self, tag, attrs): if tag == 'a': for (attribute, value) in attrs: if attribute == 'href': url = parse.urljoin(self.base_url, value) self.links.add(url) def page_links(self): return self.links def error(self, message): passAnd this one is what I try to use for the domain parsing part (for which im getting exit code 1 error)
from urllib.parse import urlparse # Get domain name (example.com) def get_domain_name(url): try: results = get_sub_domain_name(url).split('.') return results[-2] + '.' + results[-1] except: return '' # Get sub domain name (name.example.com) def get_sub_domain_name(url): try: return urlparse(url).netloc except: return '' print(get_domain_name(www.startlap.hu))I would really appreciate any help as to where did I screwed up the code, as Im very much out of ideas ><
Thank you in advance for your help
For reference sake: i took the files from: https://github.com/buckyroberts/Spider to basically experiment on them and see how they works. Sadly despite following the youtube tutorial that goes alongside with it, I basically got nowhere