Scraper issue

scraperwannaB · (This post was last modified: Aug-13-2024, 06:06 PM by scraperwannaB.)

Hi,

I am new to python, but trying to learn. I am trying with the scraper code below:

Given:

# import required modules
from bs4 import BeautifulSoup
 
# reading content
file = open("output.xml", "r")
contents = file.read()
 
# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')
 
# display content
for data in titles:
    print(data.get_text())

I would like extract data from a webpage that I want to get only the contents of the title tags. The problem I am having is that the page has infinity scrolling on the page. I can get the xml from the page, but I can only seem to get the default pages titles. Let me be a bit more detailed:

I load up the page, there is an RSS feed link which is what I use to get the xml file I will eventually extract the titles from, but I tried scrolling down some to show the page 2 url in the url window, but I can no longer see the RSS button to scrape as I did when at the top of the page. Is there more coding to the python script, or can this be resolved by changing the RSS feed url (I did that actually, and I wasn't getting RSS results) so I am assuming I need something additional added to my python code.

So what my main concern here is this: If my source to the xml is through a link on the top of the webpage (webpage has infinity scrolling) how can I get the entire webpages xml into that one xml file, or is that even possible? If not possible, how should I change my python code to make that possible?

Any help greatly appreciated!
swb

**Larz60+** · Aug-13-2024, 09:53 PM

Does the XML include a 'title' element?

scraperwannaB · Aug-14-2024, 12:59 AM

(Aug-13-2024, 09:53 PM)Larz60+ Wrote: Does the XML include a 'title' element?

Yes.

**Larz60+** · Aug-14-2024, 07:48 AM

are rows consistant (2 elements to each row)?
does 'title' have any capitalization? (XML is case sensitive)

If all are true, Beautiful soup should find the element.

One think you might try, is to using 'lxml' instead of 'xml' as attribute to BeautifulSoup, just on a whim.

scraperwannaB · (This post was last modified: Aug-14-2024, 12:06 PM by scraperwannaB.)

(Aug-14-2024, 07:48 AM)Larz60+ Wrote: are rows consistant (2 elements to each row)?
does 'title' have any capitalization? (XML is case sensitive)

If all are true, Beautiful soup should find the element.

One think you might try, is to using 'lxml' instead of 'xml' as attribute to BeautifulSoup, just on a whim.

Thanks,

I will give that a go and check back.

Edit: just to be clear, the python code DOES work, I just can't seem to go past the first page. So I believe, the first part of your last reply checks out ok. But I will double check with those possible causes.

Thanks again

**deanhystad** · Aug-14-2024, 02:35 PM

Link to Stack Overflow question about getting scrolled content.

https://stackoverflow.com/questions/6904...e-scroller

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Scraper with BeautifulSoup4 sometimes no Output	Nitsuj	2	3,073	Nov-26-2021, 05:04 PM Last Post: snippsat
	How to make scraper send email notification just once	themech25	0	2,028	Nov-08-2021, 01:51 PM Last Post: themech25
	A mail scraper for sorting emails into categories	anata2047	1	7,136	Sep-03-2019, 11:03 PM Last Post: micseydel

Scraper issue

User Panel Messages

Announcements