Python Forum
BeautifulSoup not parsing other URLs
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup not parsing other URLs
#1
Hello again everyone. The following issue I have currently at hand. The script runs to the second page for example "https://www.startpage.com/lookup/?search=position%202&version=NUM2200" and then return back to the first page of https://www.startpage.com/lookup/?search=position%201&version=NUM2200. I used urljoin to for the pages of the base and relative page but it keep cycling back and forth from page 1 to page 2 and back. Why does it want to do that? What can I do to fix this? thanks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from lxml import etree
import html5lib
import requests
from bs4 import BeautifulSoup
 
 
 
while True:
      
     request = requests.get(url) #Get URL server status
      
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     dom = etree.HTML(str(soup)) #Ini etree
     url = urljoin(BASE_URL, dom.xpath('/html/body/div[2]/div/section/div[3]/div/div[2]/section/div[1]/div[1]/div[1]/a')[0].get("href")) #Join Relative and Base for full URL of next Page URL
     print('THis is Next url',url)
    
          
     for a in soup.find_all("span", {'class': re.compile(r'^text')}): #Get Text in Span Class and Filter out specific words
          bltext=a.text
          if bltext == 'cook Book':
               st = bltext.replace('cook  Book','')
          elif bltext == 'Study Tools':
               st = bltext.replace('Study Tools','')
          elif bltext == 'Explore More':
               st = bltext.replace('Explore More','')
          elif bltext == 'WayPlus':
               st = bltext.replace('WayPlus','')
          elif bltext == 'Explore More':
               st = bltext.replace('Explore More','')
          elif bltext == 'Store':
               st = bltext.replace('Store','')
 
          else:
                         
               print('\n',a.text)
      
                
 
           #with open(f'{chp}.txt', 'w', encoding='utf-8') as f:
            #f.write(chp+'\n'+i.text)
    
     print('pages',url)
 
     print('This is url', url)
      
          break #Break out of loop
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup: 6k records - but stops after parsing 20 lines apollo 0 2,274 May-10-2021, 05:08 PM
Last Post: apollo
  Logic behind BeautifulSoup data-parsing jimsxxl 7 6,016 Apr-13-2021, 09:06 AM
Last Post: jimsxxl
  Need to Verify URLs; getting SSLError rahul_goswami 0 2,824 Aug-20-2019, 10:17 AM
Last Post: rahul_goswami
  Regex URLs Django 2.1 sterion66 0 3,280 Nov-04-2018, 10:22 AM
Last Post: sterion66
  hi new at python , trying to get urls from website dviry 6 6,063 Feb-24-2018, 07:34 PM
Last Post: metulburr
  BeautifulSoup Parsing Error slinkplink 6 13,143 Feb-12-2018, 02:55 PM
Last Post: seco
  Beautifulsoup parsing Larz60+ 7 7,476 Apr-05-2017, 03:07 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020