Python Forum
[Scrapy] web scrape help - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: [Scrapy] web scrape help (/thread-21446.html)



[Scrapy] web scrape help - joe_momma - Sep-30-2019

Hello,
downloaded scrapy and went through the tutorials and still trying to understand the selector
aspect of scraping. So I thought scrape a different quotes web page:
website
I created a new project and spider:
# -*- coding: utf-8 -*-
import scrapy


class InspiderSpider(scrapy.Spider):
    name = 'inspider'
    allowed_domains = ['https://www.keepinspiring.me/famous-quotes/']
    start_urls = ['https://www.keepinspiring.me/famous-quotes//']

    def parse(self, response):
        for quotes in response.css('div.author-quotes'):
            yield {
                'text': quotes.css('span.text::text').extract_first(),
                'author': quotes.css('span.quote-author-name::text').extract_first()
            }
I can extract the authors but no luck on the quote.
output:
Output:
{"text": null, "author": "-Dr. Suess"}, {"text": null, "author": "-Marilyn Monroe"}, {"text": null, "author": null}, {"text": null, "author": "-Stephen King"}, {"text": null, "author": "-Mark Caine"}, {"text": null, "author": "-Helen Keller"}, .....
when I examine the quote element and copy xpath I get:
Output:
//*[@id="entry-4812"]/div/div[1]/div[6]/text()
any help appreciated,
Joe


RE: [Scrapy] web scrape help - stranac - Sep-30-2019

span.text::text tries selecting the text of a span with class text.
Such an element doesn't exist, the text is placed directly in the div.

A css selector that would work here would be simply ::text.
This technically selects all the text nodes inside the div (including the author), but .extract_first() will give you only the thing you are after.

An alternative is using an xpath such as ./text().

A couple of non-selector-related notes:
  • Your allowed_domains is being ignored since it contains full urls instead of domains (it's optional, so your code still works)
  • You should use .get() instead of .extract_first(), that's been the recommended api for a while now



RE: [Scrapy] web scrape help - joe_momma - Oct-01-2019

thanks, I got quotes and authors.
Joe