Python Forum

Full Version: Navigate Pages in Beautifulsoup
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi everyone I am facing an issue. I am looking to cycle through webpages while printing the upcoming url and stop at a specific url, however when it starts only the second url prints. I used information from this url and this one
What modifications can I make to solve this issue? Thanks

from lxml import etree
import html5lib
import requests
from bs4 import BeautifulSoup


url = "https://www.startpage.com"  
while True:
     request = requests.get(url) #Get URL server status
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     dom = etree.HTML(str(soup)) #Ini etree
     pages = dom.xpath('//*[@id="content-column"]/div[3]/div/div[6]/div/div/a[2]')[0].get("href") #Find Next Page URL
     print('pages',pages)
     nextpage = requests.get(pages) #Get New URL server status
     nextsoup = BeautifulSoup(nextpage.content, 'html5lib') #Pass New url content to NextSoup
     print(nextsoup) #Check to see if next page content is being viewed
     endpage = dom.xpath('//*[@id="content-column"]/div[3]/div/div[6]/div/div/a[2]')[0].get("href")
     print(dom.xpath('//*[@id="content-column"]/div[3]/div/div[6]/div/div/a[2]')[0].get("href")) #Print the link Pages
     if endpage is 'https://www.endpage.com': #Page to Stop
            break #Break out of loop
I have no experience with your way of searching the DOM of a webpage so I am not sure. But I see lines 14 to 18 seem to be a copy of lines 9 to 13.
So it occurs to me you should rename the variable "pages" on line 12 to "url" and remove lines 14 to 18 so the "while" loop can handle the new url. "endpage" should then be named "url".
(Feb-19-2022, 01:52 PM)giddyhead Wrote: [ -> ]I am looking to cycle through webpages while printing the upcoming url and stop at a specific url, however when it starts only the second url prints.
Post your real code(working url),don't need example code then try a guess what your problem is.
(Feb-19-2022, 03:01 PM)ibreeden Wrote: [ -> ]I have no experience with your way of searching the DOM of a webpage so I am not sure. But I see lines 14 to 18 seem to be a copy of lines 9 to 13.
So it occurs to me you should rename the variable "pages" on line 12 to "url" and remove lines 14 to 18 so the "while" loop can handle the new url. "endpage" should then be named "url".

Thanks for the information it worked. I also changed
if url [b]in[/b] 'https://www.endpage.com': #Page to Stop
to have it break at the URL. The issue now is that it starts from page 2 and not the first page. I can have it stop on the start page. Thank you for your help. Have a question besides that how might I make start from the first page (startpage) instead of the second page. Thanks
(Feb-19-2022, 07:03 PM)giddyhead Wrote: [ -> ]
(Feb-19-2022, 03:01 PM)ibreeden Wrote: [ -> ]I have no experience with your way of searching the DOM of a webpage so I am not sure. But I see lines 14 to 18 seem to be a copy of lines 9 to 13.
So it occurs to me you should rename the variable "pages" on line 12 to "url" and remove lines 14 to 18 so the "while" loop can handle the new url. "endpage" should then be named "url".

Thanks for the information it worked. I also changed
if url [b]in[/b] 'https://www.endpage.com': #Page to Stop
to have it break at the URL. The issue now is that it starts from page 2 and not the first page. I can have it stop on the start page. Thank you for your help. Have a question besides that how might I make start from the first page (startpage) instead of the second page. Thanks

Completed I figured it out. The issue I was not checking the value before the next url took hold. Once again. Thanks.