Oct-30-2019, 08:54 AM
So I'm trying to work with a Scrapy project that I found, called RISJbot (GitHub) for extracting the contents of news articles for my research, however I encountered a problem that I can't find the source of, or the way to fix it: the spider can't find body texts (in my case, most importantly: the articles themselves). I tried two of these spiders so far: the CNN one gave mixed results, the Washinton Post one couldn't find a single one.
It gives me this error message:
What would you recommend me to do? I couldn't find a way so far to fix it.
Here's the spider itself (although as you can see, it imports a lot of strings from other files, that's why I shared the GitHub page with you):
It gives me this error message:
Error:ERROR:RISJbot.pipelines.checkcontent:No bodytext: https://www.washingtonpost.com/world/europe/russia-and-cuba-rebuild-ties-that-frayed-after-cold-war/2019/10/29/d046cc0a-fa09-11e9-9e02-1d45cb3dfa8f_story.html
It also returns this error message, I'm not sure if it has any link to my problem: Error:ERROR:scrapy.utils.signal:Error caught on signal handler: > Traceback (most recent call last): File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\scrapy\extensions\feedexport.py", line 243, in item_scraped slot = self.slot AttributeError: 'FeedExporter' object has no attribute 'slot'
When it doesn't find the body text, as a fallback, it generates a gzipped, Base 64-encoded version of the whole page. I managed to turn off this function, to check whether it has any sign of the part I'm looking for, and it indeed has the body text in it (albeit a very distorted version, with all html stuff, but I found a couple words), so it loads in, and it doesn't use JavaScript.What would you recommend me to do? I couldn't find a way so far to fix it.
Here's the spider itself (although as you can see, it imports a lot of strings from other files, that's why I shared the GitHub page with you):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# -*- coding: utf-8 -*- from RISJbot.spiders.newssitemapspider import NewsSitemapSpider from RISJbot.loaders import NewsLoader # Note: mutate_selector_del_xpath is somewhat naughty. Read its docstring. from RISJbot.utils import mutate_selector_del_xpath from scrapy.loader.processors import Identity, TakeFirst from scrapy.loader.processors import Join, Compose, MapCompose import re class WashingtonPostSpider(NewsSitemapSpider): name = 'washingtonpost' # allowed_domains = ['washingtonpost.com'] # A list of XML sitemap files, or suitable robots.txt files with pointers. def parse_page( self , response): @returns items 1 @scrapes bodytext bylines fetchtime firstpubtime headline source url @noscrapes modtime """ s = response.selector # Remove any content from the tree before passing it to the loader. # There aren't native scrapy loader/selector methods for this. #mutate_selector_del_xpath(s, '//*[@style="display:none"]') l = NewsLoader(selector = s) # WaPo's ISO date/time strings are invalid: <datetime>-500 instead of # <datetime>-05:00. Note that the various standardised l.add_* methods # will generate 'Failed to parse data' log items. We've got it properly # here, so they aren't important. l.add_xpath( 'firstpubtime' , '//*[@itemprop="datePublished" or ' '@property="datePublished"]/@content' , MapCompose( self .fix_iso_date)) # CreativeWork # These are duplicated in the markup, so uniquise them. l.add_xpath( 'bylines' , '//div[@itemprop="author-names"]/span/text()' , set ) l.add_xpath( 'section' , '//*[contains(@class, "headline-kicker")]//text()' ) # Add a number of items of data that should be standardised across # providers. Can override these (for TakeFirst() fields) by making # l.add_* calls above this line, or supplement gaps by making them # below. l.add_fromresponse(response) l.add_htmlmeta() l.add_schemaorg(response) l.add_opengraph() l.add_scrapymeta(response) return l.load_item() def fix_iso_date( self , s): return re.sub(r '^([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}[+-])' '([0-9])([0-9]{2})$' , r '\g<1>0\g<2>:\g<3>' , s) |