Python Forum
Why doesn't my spider find body text?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Why doesn't my spider find body text?
#1
So I'm trying to work with a Scrapy project that I found, called RISJbot (GitHub) for extracting the contents of news articles for my research, however I encountered a problem that I can't find the source of, or the way to fix it: the spider can't find body texts (in my case, most importantly: the articles themselves). I tried two of these spiders so far: the CNN one gave mixed results, the Washinton Post one couldn't find a single one.

It gives me this error message:

Error:
ERROR:RISJbot.pipelines.checkcontent:No bodytext: https://www.washingtonpost.com/world/europe/russia-and-cuba-rebuild-ties-that-frayed-after-cold-war/2019/10/29/d046cc0a-fa09-11e9-9e02-1d45cb3dfa8f_story.html
It also returns this error message, I'm not sure if it has any link to my problem:
Error:
ERROR:scrapy.utils.signal:Error caught on signal handler: > Traceback (most recent call last): File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\scrapy\extensions\feedexport.py", line 243, in item_scraped slot = self.slot AttributeError: 'FeedExporter' object has no attribute 'slot'
When it doesn't find the body text, as a fallback, it generates a gzipped, Base 64-encoded version of the whole page. I managed to turn off this function, to check whether it has any sign of the part I'm looking for, and it indeed has the body text in it (albeit a very distorted version, with all html stuff, but I found a couple words), so it loads in, and it doesn't use JavaScript.

What would you recommend me to do? I couldn't find a way so far to fix it.

Here's the spider itself (although as you can see, it imports a lot of strings from other files, that's why I shared the GitHub page with you):

# -*- coding: utf-8 -*-
from RISJbot.spiders.newssitemapspider import NewsSitemapSpider
from RISJbot.loaders import NewsLoader
# Note: mutate_selector_del_xpath is somewhat naughty. Read its docstring.
from RISJbot.utils import mutate_selector_del_xpath
from scrapy.loader.processors import Identity, TakeFirst
from scrapy.loader.processors import Join, Compose, MapCompose
import re

class WashingtonPostSpider(NewsSitemapSpider):
    name = 'washingtonpost'
    # allowed_domains = ['washingtonpost.com']
    # A list of XML sitemap files, or suitable robots.txt files with pointers.
    sitemap_urls = ['https://www.washingtonpost.com/news-sitemaps/index.xml']

    def parse_page(self, response):
        """@url http://www.washingtonpost.com/business/2019/10/25/us-deficit-hit-billion-marking-nearly-percent-increase-during-trump-era/?hpid=hp_hp-top-table-main_deficit-210pm%3Ahomepage%2Fstory-ans
        @returns items 1
        @scrapes bodytext bylines fetchtime firstpubtime headline source url 
        @noscrapes modtime
        """
        s = response.selector
        # Remove any content from the tree before passing it to the loader.
        # There aren't native scrapy loader/selector methods for this.        
        #mutate_selector_del_xpath(s, '//*[@style="display:none"]')

        l = NewsLoader(selector=s)

        # WaPo's ISO date/time strings are invalid: <datetime>-500 instead of
        # <datetime>-05:00. Note that the various standardised l.add_* methods
        # will generate 'Failed to parse data' log items. We've got it properly
        # here, so they aren't important.
        l.add_xpath('firstpubtime',
                    '//*[@itemprop="datePublished" or '
                        '@property="datePublished"]/@content',
                    MapCompose(self.fix_iso_date)) # CreativeWork

        # These are duplicated in the markup, so uniquise them.
        l.add_xpath('bylines',
                    '//div[@itemprop="author-names"]/span/text()',
                    set)
        l.add_xpath('section',
                    '//*[contains(@class, "headline-kicker")]//text()')


        # Add a number of items of data that should be standardised across
        # providers. Can override these (for TakeFirst() fields) by making
        # l.add_* calls above this line, or supplement gaps by making them
        # below.
        l.add_fromresponse(response)
        l.add_htmlmeta()
        l.add_schemaorg(response)
        l.add_opengraph()
        l.add_scrapymeta(response)

        return l.load_item()

    def fix_iso_date(self, s):
        return re.sub(r'^([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}[+-])'
                            '([0-9])([0-9]{2})$',
                      r'\g<1>0\g<2>:\g<3>',
                      s)
Reply
#2
Update: I also checked NY Times and FOX, and they haven't found the bodytext either, so apparently it's a systematic issue, and those few CNN articles are the outliers (for example this one).

Does anyone have any idea why this might be, and why the CNN one might be different?

Edit: The CBS one also found the bodytext everywhere (for example here), which makes me even more confused.
Reply
#3
two different programming teams -- two different methods
Reply
#4
(Oct-30-2019, 05:02 PM)Larz60+ Wrote: two different programming teams -- two different methods

Could you elaborate on that? I guess I left out an important thing: these guys created a separate spider for each media outlet. So technically all of these should work, yet it's not the case. Furthermore, those that don't work have the exact same problem. I just can't find the root cause of the issue.
Reply
#5
What I was trying to say is that web sites can be put together in many different ways, so a spider for one won't necessarily work for another, especially if JavaScript is being used.
Reply
#6
That's why I pointed out that these are acrually separate spiders, slightly modified for each website, yet with the same issue. And I confirm that in a couple cases (like in the case of the Post), the aite doesn't use javascript for the bodytext, so the issue is somewhere else.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [BeautifulSoup] Find </body>? Winfried 3 1,301 Jul-21-2023, 11:25 AM
Last Post: Gaurav_Kumar
  Deployed Spider on Heroku: How do I email downloaded files? JaneTan 2 1,562 Mar-24-2022, 08:31 AM
Last Post: JaneTan
  find a hyperlink in Gmail body python 3(imap and selenium) taomihiranga 1 8,176 Dec-30-2020, 05:31 PM
Last Post: Gamer1057
  Get html body of URL rama27 6 3,432 Aug-03-2020, 02:37 PM
Last Post: snippsat
  Is it possible to perform a PUT request by passing a req body instead of an ID ary 0 1,817 Feb-20-2019, 05:55 AM
Last Post: ary
  XML Parsing - Find a specific text (ElementTree) TeraX 3 4,051 Oct-09-2018, 09:06 AM
Last Post: TeraX
  How to find particular text from td tag using bs4 Prince_Bhatia 7 5,880 Sep-24-2018, 08:36 PM
Last Post: nilamo
  BS4 Not Able To Find Text In CSS Comments digitalmatic7 4 5,224 Feb-27-2018, 03:45 AM
Last Post: digitalmatic7
  In CSV, how to write the header after writing the body? Tim 18 14,606 Jan-06-2018, 01:54 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020