Extract tag body without leading/trailing spaces

Extract tag body without leading/trailing spaces - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Extract tag body without leading/trailing spaces (/thread-22746.html)

Extract tag body without leading/trailing spaces - Pavel_47 - Nov-25-2019

Hello,
Executing of this code:

import requests
from bs4 import BeautifulSoup

isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
 
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
 
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

url = 'https://www.amazon.com' + a.get('href')
print(url)
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
span = soup.find('span', {'id': 'ebooksProductTitle'})
print(span.text)

results in this:

Output:/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1
https://www.amazon.com/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1

                
                    
                    
                        AI in Cybersecurity (Intelligent Systems Reference Library Book 151)

$
How to avoid leading/trailing spaces before/after AI in Cybersecurity (Intelligent Systems Reference Library Book 151) ?
Thanks.

RE: Extract tag body without leading/trailing spaces - snippsat - Nov-25-2019

Use strip().

print(span.text.strip())

RE: Extract tag body without leading/trailing spaces - buran - Nov-25-2019

Now, the question here is, do you need to make yet another request to get the book title. And the answer is - NO

import requests
from bs4 import BeautifulSoup
 
isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
  
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
  
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
  
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))
print(a.text.strip())

RE: Extract tag body without leading/trailing spaces - Pavel_47 - Nov-26-2019

The problem is the 1st URL doesn't contain whole information, e.g. there is no Publisher.
That's why I proceed in two steps: 1st request - the book is searched by its ISBN, on the html page that opens I search for link to the html, dedicated to a particular book (note that 1st request doesn't provide "dedicated" page ... there are also other books on this html). Then using 2nd url (or book dedicated page) I search for all necessary info - title, author, publisher, year, ratings ...