Problem with searching over Beautiful Soap object

Pavel_47 · May-28-2022, 01:50 PM

Well ... if I stay at the current version of Selenium, how to find Publisher ?
I've just tried

publisher = browser.find_element_by_name('Publisher')

Search failed and threw exception.

Pavel_47 · May-28-2022, 02:02 PM

Well, this instruction do the job:

publisher = browser.find_elements_by_xpath("//*[contains(text(), 'Publisher')]")

But the real value of Publisher (i.e. Springer) is the next field.
How to advance to the next field ?

***snippsat*** · (This post was last modified: May-28-2022, 02:15 PM by snippsat.)

This is the old way browser.find_elements_by_xpath(Deprecated) when use Selenium 4 is like this.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

#--| Setup
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
url = "https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1"
browser.get(url)
title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
# with CSS selector
publisher = browser.find_element(By.CSS_SELECTOR, '#detailBullets_feature_div > ul > li:nth-child(2) > span > span:nth-child(2)')
print(publisher.text)
# With XPath
publisher1 = browser.find_element(By.XPATH, '//*[@id="detailBullets_feature_div"]/ul/li[2]/span/span[2]')
print(publisher1.text)

Output:Advanced Artificial Intelligence and Robo-Justice
Springer (May 16, 2022)
Springer (May 16, 2022)

Pavel_47 · May-28-2022, 02:30 PM

(May-28-2022, 02:15 PM)snippsat Wrote: This is the old way browser.find_elements_by_xpath(Deprecated) when use Selenium 4 is like this.
Output:Advanced Artificial Intelligence and Robo-Justice
Springer (May 16, 2022)
Springer (May 16, 2022)

Ok, it works.
But this method relies on the layout of this book.
With another book, the layout may be slightly different.
I think a safer method is to find the tag containing "Publisher", then move to the next tag at the same level of hierarchy, and finally extract the text from that tag.
Can selenium provide such methods.
If I remember correctly, BeautifulSoup provides a navigation functions over neighboring tags.

***snippsat*** · (This post was last modified: May-28-2022, 06:06 PM by snippsat.)

(May-28-2022, 02:30 PM)Pavel_47 Wrote: I think a safer method is to find the tag containing "Publisher", then move to the next tag at the same level of hierarchy, and finally extract the text from that tag.

Find the tag that hold all Product details list.

publisher = browser.find_element(By.CSS_SELECTOR, '#detailBulletsWrapper_feature_div')
print(publisher.text)

Output:Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming
Product details
Publisher : No Starch Press; 2nd edition (May 3, 2019)
Language : English
Paperback : 544 pages
ISBN-10 : 1593279280
ISBN-13 : 978-1593279288
Reading age : 12 years and up
Lexile measure : 1050L
Item Weight : 2.3 pounds
Dimensions : 7 x 1.25 x 9.25 inches
Best Sellers Rank: #780 in Books (See Top 100 in Books)
#1 in Object-Oriented Design
#1 in Python Programming
#2 in Software Development (Books)
Customer Reviews:
6,496 ratings

Get singe element would be.

>>> p = publisher.find_elements_by_css_selector('li:nth-child(1) > span > span:nth-child(2)')
>>> p
[<selenium.webdriver.remote.webelement.WebElement (session="26ba57aa713155834023884ce6f18ab7", element="43c80e5e-eee0-49d8-94d4-fd69305b17ec")>]
>>> p[0].text
'No Starch Press; 2nd edition (May 3, 2019)'
>>> p = publisher.find_elements_by_css_selector('li:nth-child(5) > span > span:nth-child(2)')
>>> p[0].text
'978-1593279288'

Quote:If I remember correctly, BeautifulSoup provides a navigation functions over neighboring tags.

Can use BS with Selenium,eg in post.

Pavel_47 · May-29-2022, 10:22 AM

Product details - Ok.
Works fine. Exploring this fragment we can extract Publisher and date.
I trued also CSS_SELECTOR for finding Author (please see screenshot below)

[Image: amazon-search-author-book-location.png]

This method doesn't work for Author.
I tried using find_element_by_class_name. Doesn't work either.

***snippsat*** · May-29-2022, 11:28 AM

Do you know that you can copy Css selector or XPath when over tag in inspect?
This is copy of Css selector '#bylineInfo > span'

title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
publisher = browser.find_element(By.CSS_SELECTOR, '#bylineInfo > span')
print(publisher.text)

Output:Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming
Eric Matthes (Author)

Pavel_47 · May-29-2022, 05:12 PM

Not sure that I understood how it works ... I mean using '>' symbol.
Searching for Reviews section of this book:
https://www.amazon.com/Discovering-Moder...136677649/
I tried a more classic approach: first find the section concerned by unique ID, then search in this ID section for the information to extract using the class name (the class name gives what I want to extract - the string "3.6 by 5")

Here is snippet I used for that:

reviews_section = browser.find_element_by_id('acrPopover')
score = reviews_section.find_elements_by_class_name('a-icon-alt')
print(score[0].text)

Unfortunately the print output is empty.
Here is screenshot of the concerned fragment of book page with outlined "centers of interest":
[Image: amazon-reviews-section.png]

***snippsat*** · May-30-2022, 09:46 AM

I mean like this click on ... or in some cases right click works.
Then is easier as you get correct selector or XPath for chosen tag.

Pavel_47 · Jun-30-2022, 11:17 AM

Tried with css_selector and class name: nothing in print output

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome('/usr/bin/chromedriver', options=options)
url = 'https://www.amazon.com/Discovering-Modern-Depth-Peter-Gottschling/dp/0136677649/'
browser.get(url)
reviews1 = browser.find_element_by_css_selector('span.a-icon-alt')
reviews2 = browser.find_element_by_class_name('a-icon-alt')
print(reviews1.text)
print(reviews2.text)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Beautifull Soap. Split page using a value and not a tag.	lillo123	5	4,748	Apr-21-2021, 09:11 AM Last Post: lillo123
	Beautiful Soap can't find a specific section on the page	Pavel_47	1	3,271	Jan-18-2021, 02:18 PM Last Post: snippsat
	form.populate_obj problem "object has no attribute translate"	pascale	0	4,526	Jun-12-2019, 07:30 PM Last Post: pascale
	Type Not Found error on python soap call using suds library	wellborn	1	5,531	Dec-19-2017, 07:53 PM Last Post: micseydel
	Help with beautiful soup	Larz60+	5	5,593	Jul-18-2017, 08:19 PM Last Post: Larz60+

Problem with searching over Beautiful Soap object

User Panel Messages

Announcements