Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraper issue
#1
Hi,

I am new to python, but trying to learn. I am trying with the scraper code below:

Given:

# import required modules
from bs4 import BeautifulSoup
 
# reading content
file = open("output.xml", "r")
contents = file.read()
 
# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')
 
# display content
for data in titles:
    print(data.get_text())
I would like extract data from a webpage that I want to get only the contents of the title tags. The problem I am having is that the page has infinity scrolling on the page. I can get the xml from the page, but I can only seem to get the default pages titles. Let me be a bit more detailed:

I load up the page, there is an RSS feed link which is what I use to get the xml file I will eventually extract the titles from, but I tried scrolling down some to show the page 2 url in the url window, but I can no longer see the RSS button to scrape as I did when at the top of the page. Is there more coding to the python script, or can this be resolved by changing the RSS feed url (I did that actually, and I wasn't getting RSS results) so I am assuming I need something additional added to my python code.

So what my main concern here is this: If my source to the xml is through a link on the top of the webpage (webpage has infinity scrolling) how can I get the entire webpages xml into that one xml file, or is that even possible? If not possible, how should I change my python code to make that possible?

Any help greatly appreciated!
swb
Reply
#2
Does the XML include a 'title' element?
Reply
#3
(Aug-13-2024, 09:53 PM)Larz60+ Wrote: Does the XML include a 'title' element?
Yes.
Reply
#4
are rows consistant (2 elements to each row)?
does 'title' have any capitalization? (XML is case sensitive)

If all are true, Beautiful soup should find the element.

One think you might try, is to using 'lxml' instead of 'xml' as attribute to BeautifulSoup, just on a whim.
Reply
#5
(Aug-14-2024, 07:48 AM)Larz60+ Wrote: are rows consistant (2 elements to each row)?
does 'title' have any capitalization? (XML is case sensitive)

If all are true, Beautiful soup should find the element.

One think you might try, is to using 'lxml' instead of 'xml' as attribute to BeautifulSoup, just on a whim.

Thanks,

I will give that a go and check back.

Edit: just to be clear, the python code DOES work, I just can't seem to go past the first page. So I believe, the first part of your last reply checks out ok. But I will double check with those possible causes.

Thanks again
Reply
#6
Link to Stack Overflow question about getting scrolled content.

https://stackoverflow.com/questions/6904...e-scroller
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web Scraper with BeautifulSoup4 sometimes no Output Nitsuj 2 3,073 Nov-26-2021, 05:04 PM
Last Post: snippsat
  How to make scraper send email notification just once themech25 0 2,028 Nov-08-2021, 01:51 PM
Last Post: themech25
  A mail scraper for sorting emails into categories anata2047 1 7,136 Sep-03-2019, 11:03 PM
Last Post: micseydel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020