Python Forum
Problem with searching over Beautiful Soap object
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Problem with searching over Beautiful Soap object
#21
Well ... if I stay at the current version of Selenium, how to find Publisher ?
I've just tried
publisher = browser.find_element_by_name('Publisher')
Search failed and threw exception.
Reply
#22
Well, this instruction do the job:
publisher = browser.find_elements_by_xpath("//*[contains(text(), 'Publisher')]")
But the real value of Publisher (i.e. Springer) is the next field.
How to advance to the next field ?
Reply
#23
This is the old way browser.find_elements_by_xpath(Deprecated) when use Selenium 4 is like this.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

#--| Setup
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
url = "https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1"
browser.get(url)
title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
# with CSS selector
publisher = browser.find_element(By.CSS_SELECTOR, '#detailBullets_feature_div > ul > li:nth-child(2) > span > span:nth-child(2)')
print(publisher.text)
# With XPath
publisher1 = browser.find_element(By.XPATH, '//*[@id="detailBullets_feature_div"]/ul/li[2]/span/span[2]')
print(publisher1.text) 
Output:
Advanced Artificial Intelligence and Robo-Justice Springer (May 16, 2022) Springer (May 16, 2022)
Reply
#24
(May-28-2022, 02:15 PM)snippsat Wrote: This is the old way browser.find_elements_by_xpath(Deprecated) when use Selenium 4 is like this.
Output:
Advanced Artificial Intelligence and Robo-Justice Springer (May 16, 2022) Springer (May 16, 2022)
Ok, it works.
But this method relies on the layout of this book.
With another book, the layout may be slightly different.
I think a safer method is to find the tag containing "Publisher", then move to the next tag at the same level of hierarchy, and finally extract the text from that tag.
Can selenium provide such methods.
If I remember correctly, BeautifulSoup provides a navigation functions over neighboring tags.
Reply
#25
(May-28-2022, 02:30 PM)Pavel_47 Wrote: I think a safer method is to find the tag containing "Publisher", then move to the next tag at the same level of hierarchy, and finally extract the text from that tag.
Find the tag that hold all Product details list.
publisher = browser.find_element(By.CSS_SELECTOR, '#detailBulletsWrapper_feature_div')
print(publisher.text)
Output:
Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming Product details Publisher : No Starch Press; 2nd edition (May 3, 2019) Language : English Paperback : 544 pages ISBN-10 : 1593279280 ISBN-13 : 978-1593279288 Reading age : 12 years and up Lexile measure : 1050L Item Weight : 2.3 pounds Dimensions : 7 x 1.25 x 9.25 inches Best Sellers Rank: #780 in Books (See Top 100 in Books) #1 in Object-Oriented Design #1 in Python Programming #2 in Software Development (Books) Customer Reviews: 6,496 ratings
Get singe element would be.
>>> p = publisher.find_elements_by_css_selector('li:nth-child(1) > span > span:nth-child(2)')
>>> p
[<selenium.webdriver.remote.webelement.WebElement (session="26ba57aa713155834023884ce6f18ab7", element="43c80e5e-eee0-49d8-94d4-fd69305b17ec")>]
>>> p[0].text
'No Starch Press; 2nd edition (May 3, 2019)'
>>> p = publisher.find_elements_by_css_selector('li:nth-child(5) > span > span:nth-child(2)')
>>> p[0].text
'978-1593279288'
Quote:If I remember correctly, BeautifulSoup provides a navigation functions over neighboring tags.
Can use BS with Selenium,eg in post.
Reply
#26
Product details - Ok.
Works fine. Exploring this fragment we can extract Publisher and date.
I trued also CSS_SELECTOR for finding Author (please see screenshot below)

[Image: amazon-search-author-book-location.png]

This method doesn't work for Author.
I tried using find_element_by_class_name. Doesn't work either.
Reply
#27
Do you know that you can copy Css selector or XPath when over tag in inspect?
This is copy of Css selector '#bylineInfo > span'
title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
publisher = browser.find_element(By.CSS_SELECTOR, '#bylineInfo > span')
print(publisher.text)
Output:
Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming Eric Matthes (Author)
Reply
#28
Not sure that I understood how it works ... I mean using '>' symbol.
Searching for Reviews section of this book:
https://www.amazon.com/Discovering-Moder...136677649/
I tried a more classic approach: first find the section concerned by unique ID, then search in this ID section for the information to extract using the class name (the class name gives what I want to extract - the string "3.6 by 5")

Here is snippet I used for that:
reviews_section = browser.find_element_by_id('acrPopover')
score = reviews_section.find_elements_by_class_name('a-icon-alt')
print(score[0].text)
Unfortunately the print output is empty.
Here is screenshot of the concerned fragment of book page with outlined "centers of interest":
[Image: amazon-reviews-section.png]
Reply
#29
I mean like this click on ... or in some cases right click works.
Then is easier as you get correct selector or XPath for chosen tag.
[Image: 306D7j.png]
Reply
#30
Tried with css_selector and class name: nothing in print output

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome('/usr/bin/chromedriver', options=options)
url = 'https://www.amazon.com/Discovering-Modern-Depth-Peter-Gottschling/dp/0136677649/'
browser.get(url)
reviews1 = browser.find_element_by_css_selector('span.a-icon-alt')
reviews2 = browser.find_element_by_class_name('a-icon-alt')
print(reviews1.text)
print(reviews2.text)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Beautifull Soap. Split page using a value and not a tag. lillo123 5 3,437 Apr-21-2021, 09:11 AM
Last Post: lillo123
  Beautiful Soap can't find a specific section on the page Pavel_47 1 2,439 Jan-18-2021, 02:18 PM
Last Post: snippsat
  Beautiful soup and tags starter_student 11 6,202 Jul-08-2019, 03:41 PM
Last Post: starter_student
  Beautiful Soup find_all() kirito85 2 3,388 Jun-14-2019, 02:17 AM
Last Post: kirito85
  form.populate_obj problem "object has no attribute translate" pascale 0 3,663 Jun-12-2019, 07:30 PM
Last Post: pascale
  Need help with Beautiful Soup - table jlkmb 9 5,969 Dec-20-2018, 01:10 AM
Last Post: jlkmb
  Type Not Found error on python soap call using suds library wellborn 1 4,624 Dec-19-2017, 07:53 PM
Last Post: micseydel
  Help with beautiful soup Larz60+ 5 4,478 Jul-18-2017, 08:19 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020