Strange BS4 Problem While Scraping RSS Feeds

digitalmatic7 · Feb-15-2018, 01:04 AM

For some reason when I try to scrape links from any RSS feed it saves them with improper syntax.

Example, instead of:

['<link>http://url.com/1/</link>', '<link>http://url.com/2/</link>', '<link>http://url.com/3/</link>']

It gives me results like this:

[<link>http://url.com/1/</link>, <link>http://url.com/2/</link>, <link>http://url.com/3/</link>]

When I try to pull the innertext and just get a clean link list with no tags I get errors:

"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Here's my full code:

page = requests.get(http://www.cbc.ca/cmlink/rss-topstories,                                                                                                                    
                    headers=
                    {"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) "
                                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                                   "Chrome/60.0.3112.90 Safari/537.36"})

soup = BeautifulSoup(page.content, features="xml")                     

link_list = soup.find_all('link')
link_list = link_list.text

Any ideas why the list it scrapes is broken?

wavic · Feb-15-2018, 02:05 AM

page = BeautifulSoup(url, 'lxml') # is enough to parse the rss

When you use find_all method it returns a list of the found elements. This list has no 'text' attribute.

This should give you what you want. Probably. I am not familiar with the xml as I want

for link in soup.find_all('link'):
    print(link.text)

digitalmatic7 · Feb-15-2018, 03:09 AM

Perfect thanks!

**Larz60+** · Feb-15-2018, 03:18 AM

Try this:

import requests
from bs4 import BeautifulSoup

class GetStories:
    def __init__(self):
        self.stories = {}

    def get_stories(self):
        page = requests.get('http://www.cbc.ca/cmlink/rss-topstories')

        soup = BeautifulSoup(page.content, features="lxml")
        next_node = soup.select('item')

        item_number = 1
        for item in next_node:
            stories_key = 'story{}'.format(item_number)
            self.stories[stories_key] = {}
            self.stories[stories_key]['title'] = item.find('title')
            self.stories[stories_key]['link'] = item.find('link')
            self.stories[stories_key]['pubdate'] = item.find('pubdate')
            self.stories[stories_key]['author'] = item.find('author')
            self.stories[stories_key]['category'] = item.find('category')
            self.stories[stories_key]['description'] = item.find('p')
            item_number += 1

    def testit(self):

        for story, content in self.stories.items():
            print('\nstory Number: {}'.format(story))
            print('title: {}'.format(content['title']))
            print('link: {}'.format(content['link']))
            print('pubdate: {}'.format(content['pubdate']))
            print('author: {}'.format(content['author']))
            print('category: {}'.format(content['category']))
            print('description: {}'.format(content['description']))

if __name__ == '__main__':
    gs = GetStories()
    gs.get_stories()
    gs.testit()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Strange phenomena with Amazon_dot_com scraping	Pavel_47	9	3,492	Jan-22-2021, 10:37 AM Last Post: pjkaka
	Coding problem scraping Goodreads reviews with GoodReadsScraper	ledgreve	3	2,293	Jan-07-2020, 09:38 AM Last Post: ledgreve
	Strange BS4 Scraping Issue	digitalmatic7	1	2,404	Jan-14-2018, 04:34 PM Last Post: wavic

Strange BS4 Problem While Scraping RSS Feeds

User Panel Messages

Announcements