Python Forum

Good Day Everyone. I am having issues with web scaping as I am not sure why it does not want to scape. I am using xpath and also soup to gather the next URL to check if it works however it does not want to work. What am I doing wrong?

import requests
from lxml import etree
import html5lib
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time, re
import csv
import time

start = time.time()

print('Starting Program')       
base ="https://www.studylight.org/lexicons/eng/hebrew/1.html"
url = "https://www.studylight.org/lexicons/eng/hebrew/1.html"

while True:
     
     request = requests.get(urljoin(base,url)) #Get URL server status
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     
     dom = etree.HTML(str(soup)) #Ini etree
     url = dom.xpath('/html/body/div[1]/div[3]/div[2]/div[4]/form/div/div[3]/div[2]/a') #Find Next Page URL
     url2 = urljoin(base,url)

     urltest2 = soup.find_all("span", class_="greek-hebrew fs-21") #Find next url
     print('Test First url', url2,' Test number 2 ' , urltest2)
     # #for line in soup.find_all('a'):
     #       #print(urljoin(base,line.text))#.get('href'))

     if url2 in 'https://www.studylight.org/lexicons/eng/hebrew/3.html':  # Page to Stop
          break  # Break out of loop

print('Program Completed')

Couple of issues found in your code:

The XPath expression '/html/body/div[1]/div[3]/div[2]/div[4]/form/div/div[3]/div[2]/a' might not be accurately targeting the next page URL. Ensure that the XPath is correctly pointing to the anchor tag (<a>) containing the link to the next page.

After retrieving the URL using XPath, you're trying to join it with the base URL using urljoin(base, url). However, url is a list returned by XPath, so you should extract the URL string from the list before joining it.

Here's a revised version of your code:

import requests
from lxml import etree
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

start = time.time()

print('Starting Program')
base = "https://www.studylight.org/lexicons/eng/hebrew/1.html"
url = "https://www.studylight.org/lexicons/eng/hebrew/1.html"

while True:
    request = requests.get(urljoin(base, url))
    soup = BeautifulSoup(request.content, 'html5lib')

    url_tags = soup.select('a[href^="/lexicons/eng/hebrew/"]')  # CSS Selector for next page URL
    if url_tags:
        next_page_url = url_tags[0]['href']
        url = next_page_url
        print('Next Page URL:', url)
    else:
        break

print('Program Completed')

giddyhead

AhanaSharma