Getting Correct 'a'-tag output

soothsayerpg · (This post was last modified: Jul-25-2018, 10:21 AM by soothsayerpg.)

I've been trying for almost a week getting the right output for the:

Quote:rel="nofollow"

Yes, I've been having success if the a-tag contains the said attribute, but if not, it output either: 'author', bookmark, tag, and other that would be inside 'rel='.

What I want is for the call if there is a 'nofollow', then output 'dofollow'.

Here's my code:

import scrapy


class LinkSpider(scrapy.Spider):
	name =  'TestBOT'
	base_url = ['example.com']
	start_urls = ['I don't know if it's right to publicise the domain name, but if you are familiar with the 'rel' you can test my script on a site you know that has it']

	def parse(self, response):
		linktype = response.css('a::attr(rel)').extract()
		if linktype != 'nofollow':
			print('dofollow')
		else:
			print('nofollow')

		for data in response.css('a'):
			yield {
				'Link': data.css('a::attr(href)').extract(),
				'Anchor': data.css('a::text').extract(),
				'LinkType': data.css('a::attr(rel)').extract(),

			}

Tried xpath but I'm not getting any output than css selector. So, I would appreciate if we stick with .css selector. Smile

I hope there is someone who can help me on this. And thanks a bunch!

**Larz60+** · (This post was last modified: Jul-25-2018, 10:32 AM by Larz60+.)

parse is a generator, so in a generator, the statements prior to the loop with yield:

    name =  'TestBOT'
    base_url = ['example.com']
    start_urls = ['I don't know if it's right to publicise the domain name, but if you are familiar with the 'rel' you can test my script on a site you know that has it']
 
    def parse(self, response):
        linktype = response.css('a::attr(rel)').extract()
        if linktype != 'nofollow':
            print('dofollow')
        else:
            print('nofollow')

will only get executed once for each instance (equate it to initialization).
the loop portion gets executed each time the generator is called until exhausted.
So if you want something to execute each time, it has to be within the loop.
best to make it a separate function and call it from within the loop (much easier for others to read)

soothsayerpg · Jul-26-2018, 05:18 AM

I'm green and didn't get what you are trying to say. Can you elaborate by making the script you are trying to tell?

soothsayerpg · Jul-26-2018, 06:25 AM

 
       for data in response.css('a'):
            yield {
                'url': data.css('a::attr(href)').extract(),
                'text': data.css('a::text').extract(),
                'rel': data.css('a::attr(rel)').extract()
            }
            linktype = response.css('a::attr(rel)').extract()
            if linktype != 'nofollow':
                print('dofollow')
            else:
                print('nofollow')

Just had a looked at the terminal output and I seem to print out what I intent (if statement).

But since I"m new to func, then how can I aside from 'print' is to write this in my excel when I run and replacing the:

yield {
    'rel': data.css('a::attr(rel)').extract()
}

with the 'if statement output'?

Getting Correct 'a'-tag output

User Panel Messages

Announcements