HTMLParser Reading even the Closing Tag

soothsayerpg · Jul-27-2018, 10:21 AM

Hi guys, since I didn't get any clear response to the scrapy thread. I shifted and trying my luck with HTMLparser.

Here's the problem. Whenever I call for 'a' it reads

Quote:<a and also includes the </a>

of course, I call for a and didn't state to read only the opening a-tag, tried adding "<" on my a, but it didn't read and output right, it just output's nothing. It's a mystery to me at first why I'm getting 4 outputs/prints on only two hyperlink I had created then finally figured it out.

For those who are very familiar with htmlparser, I hope you can help me out, tried finding some clear solution on the internet with no luck.

Here's the code:

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                if value == 'nofollow':
                    print(value)
                else:
                    print('dofollow')

finder = myHtmlParser()
finder.feed('<html><head></head><title>Test</title><body><h1>Parse me!</h1><a rel="nofollow" href="http://sampledomain.com">sample anchor text</a><a rel="author" href="/video-page.html"></a></body></html>')

Output:nofollow
dofollow
dofollow
dofollow

**buran** · Jul-27-2018, 11:19 AM

Using BeautifulSoup

from bs4 import BeautifulSoup
html = '''
<html>
 <head>
 </head>
 <title>
  Test
 </title>
 <body>
  <h1>
   Parse me!
  </h1>
  <a href="http://sampledomain.com" rel="nofollow">sample anchor text
  </a>
  <a href="/video-page.html" rel="author">some other anchor text
  </a>
    <a href="/video-page.html">yet another anchor text
  </a
 </body>
</html>'''
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
    if a_tag.get('rel', None):
        print(a_tag['rel'])
        if a_tag['rel'] != ['nofollow']:
            print(a_tag['href'])
        else:
            print(a_tag.text)

Output:['nofollow']
sample anchor text

['author']
/video-page.html

soothsayerpg · Jul-28-2018, 12:27 PM

Is it possible using only htmlparser or I can combine beautifulsoup with htmlparser on the same sheet? How about scrapy, can I call all of them inside my .py file?

**buran** · (This post was last modified: Jul-28-2018, 12:59 PM by buran.)

html = '''
<html>
 <head>
 </head>
 <title>
  Test
 </title>
 <body>
  <h1>
   Parse me!
  </h1>
  <a href="http://sampledomain.com" rel="nofollow">sample anchor text
  </a>
  <a href="/video-page.html" rel="author">some other anchor text
  </a>
    <a href="/video-page.html">yet another anchor text
  </a
 </body>
</html>'''

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            print("Encountered a start tag:", tag)
            for attr in attrs:
                print(': '.join(attr))

parser = MyHTMLParser()
parser.feed(html)

Output:Encountered a start tag: a
href: http://sampledomain.com
rel: nofollow
Encountered a start tag: a
href: /video-page.html
rel: author
Encountered a start tag: a
href: /video-page.html

And your code is working fine, it just do what you told it to do - print something for every a tag attribute.
Here is your code with little change, so you can see what's going on

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                print('attribute: {}, value: {}, now print:'.format(attribute, value))
                if value == 'nofollow':
                    print(value)
                else:
                    print('dofollow')
            print()

Output:attribute: href, value: http://sampledomain.com, now print:
dofollow
attribute: rel, value: nofollow, now print:
nofollow

attribute: href, value: /video-page.html, now print:
dofollow
attribute: rel, value: author, now print:
dofollow

attribute: href, value: /video-page.html, now print:
dofollow

soothsayerpg · Aug-02-2018, 07:14 AM

Do you think that htmlparser can handle big crawls or would I go with Scrapy?

**buran** · (This post was last modified: Aug-02-2018, 07:19 AM by buran.)

Frankly I never used both of them in production. I prefer BeautifulSoup to htmlparser (and almost always I install lxml in addition to use as parser instead of default html.parser) and Scrapy always looked like overkill for my sporadic do-it-once scraping use cases.

soothsayerpg · Aug-02-2018, 07:34 AM

I got you.

I'm confused with both, as a rookie, they both have some area of difficulties I encounter every now and then. At htmlparser or even on other with or without modules. It's easy to do a conditional/argument but as of now all I know it to 'print' the result and I see changes, how can I tell it to "if it's not this then change to this" returning the result other than just 'print'ing the result and writing that result to a csv file?

Many thanks for the help @buran

HTMLParser Reading even the Closing Tag

User Panel Messages

Announcements