Python Forum
Strange BS4 Problem While Scraping RSS Feeds
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Strange BS4 Problem While Scraping RSS Feeds
#1
For some reason when I try to scrape links from any RSS feed it saves them with improper syntax.

Example, instead of:

['<link>http://url.com/1/</link>', '<link>http://url.com/2/</link>', '<link>http://url.com/3/</link>']
It gives me results like this:

[<link>http://url.com/1/</link>, <link>http://url.com/2/</link>, <link>http://url.com/3/</link>]
When I try to pull the innertext and just get a clean link list with no tags I get errors:

"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Here's my full code:

page = requests.get(http://www.cbc.ca/cmlink/rss-topstories,                                                                                                                    
                    headers=
                    {"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) "
                                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                                   "Chrome/60.0.3112.90 Safari/537.36"})

soup = BeautifulSoup(page.content, features="xml")                     

link_list = soup.find_all('link')
link_list = link_list.text
Any ideas why the list it scrapes is broken?
Reply
#2
page = BeautifulSoup(url, 'lxml') # is enough to parse the rss
When you use find_all method it returns a list of the found elements. This list has no 'text' attribute.

This should give you what you want. Probably. I am not familiar with the xml as I want
for link in soup.find_all('link'):
    print(link.text) 
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
Perfect thanks!
Reply
#4
Try this:
import requests
from bs4 import BeautifulSoup

class GetStories:
    def __init__(self):
        self.stories = {}

    def get_stories(self):
        page = requests.get('http://www.cbc.ca/cmlink/rss-topstories')

        soup = BeautifulSoup(page.content, features="lxml")
        next_node = soup.select('item')

        item_number = 1
        for item in next_node:
            stories_key = 'story{}'.format(item_number)
            self.stories[stories_key] = {}
            self.stories[stories_key]['title'] = item.find('title')
            self.stories[stories_key]['link'] = item.find('link')
            self.stories[stories_key]['pubdate'] = item.find('pubdate')
            self.stories[stories_key]['author'] = item.find('author')
            self.stories[stories_key]['category'] = item.find('category')
            self.stories[stories_key]['description'] = item.find('p')
            item_number += 1

    def testit(self):

        for story, content in self.stories.items():
            print('\nstory Number: {}'.format(story))
            print('title: {}'.format(content['title']))
            print('link: {}'.format(content['link']))
            print('pubdate: {}'.format(content['pubdate']))
            print('author: {}'.format(content['author']))
            print('category: {}'.format(content['category']))
            print('description: {}'.format(content['description']))

if __name__ == '__main__':
    gs = GetStories()
    gs.get_stories()
    gs.testit()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Strange phenomena with Amazon_dot_com scraping Pavel_47 9 3,492 Jan-22-2021, 10:37 AM
Last Post: pjkaka
  Coding problem scraping Goodreads reviews with GoodReadsScraper ledgreve 3 2,293 Jan-07-2020, 09:38 AM
Last Post: ledgreve
  Strange BS4 Scraping Issue digitalmatic7 1 2,404 Jan-14-2018, 04:34 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020