Python Forum

Full Version: HTMLParser Reading even the Closing Tag
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi guys, since I didn't get any clear response to the scrapy thread. I shifted and trying my luck with HTMLparser.

Here's the problem. Whenever I call for 'a' it reads
Quote:<a and also includes the </a>
of course, I call for a and didn't state to read only the opening a-tag, tried adding "<" on my a, but it didn't read and output right, it just output's nothing. It's a mystery to me at first why I'm getting 4 outputs/prints on only two hyperlink I had created then finally figured it out.

For those who are very familiar with htmlparser, I hope you can help me out, tried finding some clear solution on the internet with no luck.

Here's the code:
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                if value == 'nofollow':
                    print(value)
                else:
                    print('dofollow')

finder = myHtmlParser()
finder.feed('<html><head></head><title>Test</title><body><h1>Parse me!</h1><a rel="nofollow" href="http://sampledomain.com">sample anchor text</a><a rel="author" href="/video-page.html"></a></body></html>')
Output:
nofollow dofollow dofollow dofollow
Using BeautifulSoup

from bs4 import BeautifulSoup
html = '''
<html>
 <head>
 </head>
 <title>
  Test
 </title>
 <body>
  <h1>
   Parse me!
  </h1>
  <a href="http://sampledomain.com" rel="nofollow">sample anchor text
  </a>
  <a href="/video-page.html" rel="author">some other anchor text
  </a>
    <a href="/video-page.html">yet another anchor text
  </a
 </body>
</html>'''
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
    if a_tag.get('rel', None):
        print(a_tag['rel'])
        if a_tag['rel'] != ['nofollow']:
            print(a_tag['href'])
        else:
            print(a_tag.text)
Output:
['nofollow'] sample anchor text ['author'] /video-page.html
Is it possible using only htmlparser or I can combine beautifulsoup with htmlparser on the same sheet? How about scrapy, can I call all of them inside my .py file?
html = '''
<html>
 <head>
 </head>
 <title>
  Test
 </title>
 <body>
  <h1>
   Parse me!
  </h1>
  <a href="http://sampledomain.com" rel="nofollow">sample anchor text
  </a>
  <a href="/video-page.html" rel="author">some other anchor text
  </a>
    <a href="/video-page.html">yet another anchor text
  </a
 </body>
</html>'''

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            print("Encountered a start tag:", tag)
            for attr in attrs:
                print(': '.join(attr))

parser = MyHTMLParser()
parser.feed(html)
Output:
Encountered a start tag: a href: http://sampledomain.com rel: nofollow Encountered a start tag: a href: /video-page.html rel: author Encountered a start tag: a href: /video-page.html
And your code is working fine, it just do what you told it to do - print something for every a tag attribute.
Here is your code with little change, so you can see what's going on

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                print('attribute: {}, value: {}, now print:'.format(attribute, value))
                if value == 'nofollow':
                    print(value)
                else:
                    print('dofollow')
            print()
Output:
attribute: href, value: http://sampledomain.com, now print: dofollow attribute: rel, value: nofollow, now print: nofollow attribute: href, value: /video-page.html, now print: dofollow attribute: rel, value: author, now print: dofollow attribute: href, value: /video-page.html, now print: dofollow
Do you think that htmlparser can handle big crawls or would I go with Scrapy?
Frankly I never used both of them in production. I prefer BeautifulSoup to htmlparser (and almost always I install lxml in addition to use as parser instead of default html.parser) and Scrapy always looked like overkill for my sporadic do-it-once scraping use cases.
I got you.

I'm confused with both, as a rookie, they both have some area of difficulties I encounter every now and then. At htmlparser or even on other with or without modules. It's easy to do a conditional/argument but as of now all I know it to 'print' the result and I see changes, how can I tell it to "if it's not this then change to this" returning the result other than just 'print'ing the result and writing that result to a csv file?

Many thanks for the help @buran