Python Forum
Navigate Pages in Beautifulsoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Navigate Pages in Beautifulsoup
#1
Hi everyone I am facing an issue. I am looking to cycle through webpages while printing the upcoming url and stop at a specific url, however when it starts only the second url prints. I used information from this url and this one
What modifications can I make to solve this issue? Thanks

from lxml import etree
import html5lib
import requests
from bs4 import BeautifulSoup


url = "https://www.startpage.com"  
while True:
     request = requests.get(url) #Get URL server status
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     dom = etree.HTML(str(soup)) #Ini etree
     pages = dom.xpath('//*[@id="content-column"]/div[3]/div/div[6]/div/div/a[2]')[0].get("href") #Find Next Page URL
     print('pages',pages)
     nextpage = requests.get(pages) #Get New URL server status
     nextsoup = BeautifulSoup(nextpage.content, 'html5lib') #Pass New url content to NextSoup
     print(nextsoup) #Check to see if next page content is being viewed
     endpage = dom.xpath('//*[@id="content-column"]/div[3]/div/div[6]/div/div/a[2]')[0].get("href")
     print(dom.xpath('//*[@id="content-column"]/div[3]/div/div[6]/div/div/a[2]')[0].get("href")) #Print the link Pages
     if endpage is 'https://www.endpage.com': #Page to Stop
            break #Break out of loop
Reply
#2
I have no experience with your way of searching the DOM of a webpage so I am not sure. But I see lines 14 to 18 seem to be a copy of lines 9 to 13.
So it occurs to me you should rename the variable "pages" on line 12 to "url" and remove lines 14 to 18 so the "while" loop can handle the new url. "endpage" should then be named "url".
giddyhead likes this post
Reply
#3
(Feb-19-2022, 01:52 PM)giddyhead Wrote: I am looking to cycle through webpages while printing the upcoming url and stop at a specific url, however when it starts only the second url prints.
Post your real code(working url),don't need example code then try a guess what your problem is.
Reply
#4
(Feb-19-2022, 03:01 PM)ibreeden Wrote: I have no experience with your way of searching the DOM of a webpage so I am not sure. But I see lines 14 to 18 seem to be a copy of lines 9 to 13.
So it occurs to me you should rename the variable "pages" on line 12 to "url" and remove lines 14 to 18 so the "while" loop can handle the new url. "endpage" should then be named "url".

Thanks for the information it worked. I also changed
if url [b]in[/b] 'https://www.endpage.com': #Page to Stop
to have it break at the URL. The issue now is that it starts from page 2 and not the first page. I can have it stop on the start page. Thank you for your help. Have a question besides that how might I make start from the first page (startpage) instead of the second page. Thanks
Reply
#5
(Feb-19-2022, 07:03 PM)giddyhead Wrote:
(Feb-19-2022, 03:01 PM)ibreeden Wrote: I have no experience with your way of searching the DOM of a webpage so I am not sure. But I see lines 14 to 18 seem to be a copy of lines 9 to 13.
So it occurs to me you should rename the variable "pages" on line 12 to "url" and remove lines 14 to 18 so the "while" loop can handle the new url. "endpage" should then be named "url".

Thanks for the information it worked. I also changed
if url [b]in[/b] 'https://www.endpage.com': #Page to Stop
to have it break at the URL. The issue now is that it starts from page 2 and not the first page. I can have it stop on the start page. Thank you for your help. Have a question besides that how might I make start from the first page (startpage) instead of the second page. Thanks

Completed I figured it out. The issue I was not checking the value before the next url took hold. Once again. Thanks.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Get the image's coordinates not the canvas' when navigate on an image hobbyist 9 1,825 Jul-21-2022, 03:29 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020