Python Forum

Full Version: [Scrapy] web scrape help
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
downloaded scrapy and went through the tutorials and still trying to understand the selector
aspect of scraping. So I thought scrape a different quotes web page:
website
I created a new project and spider:
# -*- coding: utf-8 -*-
import scrapy


class InspiderSpider(scrapy.Spider):
    name = 'inspider'
    allowed_domains = ['https://www.keepinspiring.me/famous-quotes/']
    start_urls = ['https://www.keepinspiring.me/famous-quotes//']

    def parse(self, response):
        for quotes in response.css('div.author-quotes'):
            yield {
                'text': quotes.css('span.text::text').extract_first(),
                'author': quotes.css('span.quote-author-name::text').extract_first()
            }
I can extract the authors but no luck on the quote.
output:
Output:
{"text": null, "author": "-Dr. Suess"}, {"text": null, "author": "-Marilyn Monroe"}, {"text": null, "author": null}, {"text": null, "author": "-Stephen King"}, {"text": null, "author": "-Mark Caine"}, {"text": null, "author": "-Helen Keller"}, .....
when I examine the quote element and copy xpath I get:
Output:
//*[@id="entry-4812"]/div/div[1]/div[6]/text()
any help appreciated,
Joe
span.text::text tries selecting the text of a span with class text.
Such an element doesn't exist, the text is placed directly in the div.

A css selector that would work here would be simply ::text.
This technically selects all the text nodes inside the div (including the author), but .extract_first() will give you only the thing you are after.

An alternative is using an xpath such as ./text().

A couple of non-selector-related notes:
  • Your allowed_domains is being ignored since it contains full urls instead of domains (it's optional, so your code still works)
  • You should use .get() instead of .extract_first(), that's been the recommended api for a while now
thanks, I got quotes and authors.
Joe