Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Scrapy] web scrape help
#1
Hello,
downloaded scrapy and went through the tutorials and still trying to understand the selector
aspect of scraping. So I thought scrape a different quotes web page:
website
I created a new project and spider:
# -*- coding: utf-8 -*-
import scrapy


class InspiderSpider(scrapy.Spider):
    name = 'inspider'
    allowed_domains = ['https://www.keepinspiring.me/famous-quotes/']
    start_urls = ['https://www.keepinspiring.me/famous-quotes//']

    def parse(self, response):
        for quotes in response.css('div.author-quotes'):
            yield {
                'text': quotes.css('span.text::text').extract_first(),
                'author': quotes.css('span.quote-author-name::text').extract_first()
            }
I can extract the authors but no luck on the quote.
output:
Output:
{"text": null, "author": "-Dr. Suess"}, {"text": null, "author": "-Marilyn Monroe"}, {"text": null, "author": null}, {"text": null, "author": "-Stephen King"}, {"text": null, "author": "-Mark Caine"}, {"text": null, "author": "-Helen Keller"}, .....
when I examine the quote element and copy xpath I get:
Output:
//*[@id="entry-4812"]/div/div[1]/div[6]/text()
any help appreciated,
Joe
Reply
#2
span.text::text tries selecting the text of a span with class text.
Such an element doesn't exist, the text is placed directly in the div.

A css selector that would work here would be simply ::text.
This technically selects all the text nodes inside the div (including the author), but .extract_first() will give you only the thing you are after.

An alternative is using an xpath such as ./text().

A couple of non-selector-related notes:
  • Your allowed_domains is being ignored since it contains full urls instead of domains (it's optional, so your code still works)
  • You should use .get() instead of .extract_first(), that's been the recommended api for a while now
Reply
#3
thanks, I got quotes and authors.
Joe
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,087 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  Scrapy-cut: Advanced Cookiecutter Scrapy Templating scriptso 2 4,608 Feb-02-2017, 07:57 PM
Last Post: scriptso

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020