Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Problem with scrapping Website
#1
Good Day Everyone. I am having issues with web scaping as I am not sure why it does not want to scape. I am using xpath and also soup to gather the next URL to check if it works however it does not want to work. What am I doing wrong?

import requests
from lxml import etree
import html5lib
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time, re
import csv
import time

start = time.time()

print('Starting Program')       
base ="https://www.studylight.org/lexicons/eng/hebrew/1.html"
url = "https://www.studylight.org/lexicons/eng/hebrew/1.html"

while True:
     
     request = requests.get(urljoin(base,url)) #Get URL server status
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     
     dom = etree.HTML(str(soup)) #Ini etree
     url = dom.xpath('/html/body/div[1]/div[3]/div[2]/div[4]/form/div/div[3]/div[2]/a') #Find Next Page URL
     url2 = urljoin(base,url)

     urltest2 = soup.find_all("span", class_="greek-hebrew fs-21") #Find next url
     print('Test First url', url2,' Test number 2 ' , urltest2)
     # #for line in soup.find_all('a'):
     #       #print(urljoin(base,line.text))#.get('href'))

     if url2 in 'https://www.studylight.org/lexicons/eng/hebrew/3.html':  # Page to Stop
          break  # Break out of loop

print('Program Completed')
Reply
#2
Couple of issues found in your code:

The XPath expression '/html/body/div[1]/div[3]/div[2]/div[4]/form/div/div[3]/div[2]/a' might not be accurately targeting the next page URL. Ensure that the XPath is correctly pointing to the anchor tag (<a>) containing the link to the next page.

After retrieving the URL using XPath, you're trying to join it with the base URL using urljoin(base, url). However, url is a list returned by XPath, so you should extract the URL string from the list before joining it.

Here's a revised version of your code:

import requests
from lxml import etree
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

start = time.time()

print('Starting Program')
base = "https://www.studylight.org/lexicons/eng/hebrew/1.html"
url = "https://www.studylight.org/lexicons/eng/hebrew/1.html"

while True:
    request = requests.get(urljoin(base, url))
    soup = BeautifulSoup(request.content, 'html5lib')

    url_tags = soup.select('a[href^="/lexicons/eng/hebrew/"]')  # CSS Selector for next page URL
    if url_tags:
        next_page_url = url_tags[0]['href']
        url = next_page_url
        print('Next Page URL:', url)
    else:
        break

print('Program Completed')
Larz60+ write Mar-11-2024, 05:31 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Tags have been added this time. Please use BBCode tags on future posts.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python web scrapping mg24 1 338 Mar-01-2024, 09:48 PM
Last Post: snippsat
  How can I ignore empty fields when scrapping never5000 0 1,399 Feb-11-2022, 09:19 AM
Last Post: never5000
  Suggestion request for scrapping html table Vkkindia 3 2,042 Dec-06-2021, 06:09 PM
Last Post: Larz60+
  web scrapping through Python Naheed 2 2,631 May-17-2021, 12:02 PM
Last Post: Naheed
  Website scrapping and download santoshrane 3 4,340 Apr-14-2021, 07:22 AM
Last Post: kashcode
  Newbie help with lxml scrapping chelsealoa 1 1,870 Jan-08-2021, 09:14 AM
Last Post: Larz60+
  Scrapping Sport score laplacea 1 2,269 Dec-13-2020, 04:09 PM
Last Post: Larz60+
  How to export to csv the output of every iteration when scrapping with a loop efthymios 2 2,302 Nov-30-2020, 07:46 PM
Last Post: efthymios
  Problem with logging in on website - python w/ requests GoldeNx 6 5,345 Sep-25-2020, 10:52 AM
Last Post: snippsat
  Web scrapping - Stopped working peterjv26 2 3,092 Sep-23-2020, 08:30 AM
Last Post: peterjv26

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020