Python Forum

Full Version: Problem with searching over Beautiful Soap object
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
Hello,

Here is BS object where I want to extract "publisher" value.
The value I want to extract is Springer; 1st ed. 2020 edition (April 27, 2020).
How to proceed.
Thanks in advance.

<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    ASIN
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    B087R8CYZB
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Publisher
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Springer; 1st ed. 2020 edition (April 27, 2020)
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Publication date
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    April 27, 2020
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Language
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    English
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    File size
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    98586 KB
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Text-to-Speech
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Enabled
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Screen Reader
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Supported
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Enhanced typesetting
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Enabled
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    X-Ray
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Not Enabled
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Word Wise
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Not Enabled
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Print length
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    832 pages
   </span>
  </span>
 </li>
 <li>
  <span class="a-list-item">
   <span class="a-text-bold">
    Lending
                                    ‏
                                        :
                                    ‎
   </span>
   <span>
    Not Enabled
   </span>
  </span>
 </li>
</ul>
(May-15-2022, 10:55 AM)Larz60+ Wrote: [ -> ]please start here:
Web-Scraping part-1
Web-scraping part-2
Thanks.
Well. It seems I've found a solution ... probably not very elegant, but it works:

    for item in book_details.find_all('li'):
        item_name = ''.join(filter(str.isalpha, str(item.span.span.contents[0])))
        if item_name == 'Publisher':
            item_value = item.span.contents[3].contents[0]
            print(item_value)
where book_details is BS object from my previous post
Any suggestions welcome.
from bs4 import BeautifulSoup

html = '''\
<li>
  <span class="a-list-item">
    <span class="a-text-bold">
      Publisher
    </span>
    <span>
      Springer; 1st ed. 2020 edition (April 27, 2020)
    </span>
  </span>
</li>

<li>
  <span class="a-list-item">
    <span class="a-text-bold">
      Publication date
    </span>
    <span>
      April 27, 2020
    </span>
  </span>'''

soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('span', class_="a-list-item")
>>> tag.find_all('span')[0].text.strip()
'Publisher'
>>> tag.find_all('span')[1].text.strip()
'Springer; 1st ed. 2020 edition (April 27, 2020)
Also remember that BS support CSS Selectors .
>>> tag = soup.select_one('span > span:nth-child(2)')
>>> tag.text.strip()
'Springer; 1st ed. 2020 edition (April 27, 2020)'
This is what i use most,as when look in browser(inspect) can copy Selector then get the path as show here automatically.
(May-15-2022, 02:27 PM)snippsat Wrote: [ -> ]
from bs4 import BeautifulSoup

html = '''\
<li>
  <span class="a-list-item">
    <span class="a-text-bold">
      Publisher
    </span>
    <span>
      Springer; 1st ed. 2020 edition (April 27, 2020)
    </span>
  </span>
</li>

<li>
  <span class="a-list-item">
    <span class="a-text-bold">
      Publication date
    </span>
    <span>
      April 27, 2020
    </span>
  </span>'''

soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('span', class_="a-list-item")
>>> tag.find_all('span')[0].text.strip()
'Publisher'
>>> tag.find_all('span')[1].text.strip()
'Springer; 1st ed. 2020 edition (April 27, 2020)
Also remember that BS support CSS Selectors .
>>> tag = soup.select_one('span > span:nth-child(2)')
>>> tag.text.strip()
'Springer; 1st ed. 2020 edition (April 27, 2020)'
This is what i use most,as when look in browser(inspect) can copy Selector then get the path as show here automatically.

Thanks.
Well ... the task is little bit more complicated.
First, in the given BS object I have to find the section that contains Publisher (in the initial text its section isn't 0).
Then, once section Publisher is found, I have find associated with Publisher "value" section, i.e. section that contains "Springer ..."
Can do text search if found the next tag will be the Springer tag.
>>> tag = soup.find(string=re.compile('Publisher'))
>>> tag
'\n      Publisher\n    '
>>> tag.find_next()
<span>
      Springer; 1st ed. 2020 edition (April 27, 2020)
    </span>
Hello,

One more problem, related to this topic.
Here is url:
https://www.amazon.com/Advanced-Artifici...461&sr=8-1

In this url I try to find the book title.
Here is fragment of url where the book title is located:
[Image: amaton-page-exploring.jpg]

I've tried multiple ways to find it (e.g. snippet below), but all of them failed.
    for tag in soup.find_all('span', class_='a-size-extra-large'):
        print(tag)
Any suggestions ?
Thanks.
If set User Agent with Requests(and use with Bs) i did work 1-2 times then Amazon lock it out.
Quote:To discuss automated access to Amazon data please contact [email protected].
So with site like this usually have to use other methods like Selenium or look what there Api can give back.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
import logging
#logging.getLogger('WDM').setLevel(logging.NOTSET)

#--| Setup
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
#--| Parse or automation
browser.get("https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1")
time.sleep(3)
title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
Output:
Advanced Artificial Intelligence and Robo-Justice
(May-27-2022, 10:42 AM)snippsat Wrote: [ -> ]If set User Agent with Requests(and use with Bs) i did work 1-2 times then Amazon lock it out.
Quote:To discuss automated access to Amazon data please contact [email protected].

Thanks. I'll try.
By suggesting to use selenium, do you mean that BeautifulSoup is not capable of handling such tasks?
Please note that I actually have no problem with Amazon lock.
Every time I make a request on the Amazon site, I check the return status code.
When it works, this code is 200.
In my experience, to get locked you have to make about 60...100 requests from the same IP.
The lock is held for about 2 hours, then it releases.
By using Request and BS you get 200 back,but you have to look content.
There you see that get detected a denned access.
import requests
from bs4 import BeautifulSoup

user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'}
url = 'https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1'
response = requests.get(url, headers=user_agent)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('#productTitle')
print(response.status_code)
print(title)
print('-' * 20)
print(soup.find('body'))
Output:
200 None -------------------- <body> <!-- To discuss automated access to Amazon data please contact [email protected]. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases. --> <!-- Correios.DoNotSend --> .....
Pages: 1 2 3 4