Extract tag body without leading/trailing spaces

Pavel_47 · (This post was last modified: Nov-25-2019, 03:24 PM by buran.)

Hello,
Executing of this code:

import requests
from bs4 import BeautifulSoup

isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
 
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
 
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

url = 'https://www.amazon.com' + a.get('href')
print(url)
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
span = soup.find('span', {'id': 'ebooksProductTitle'})
print(span.text)

results in this:

Output:/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1
https://www.amazon.com/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1

                
                    
                    
                        AI in Cybersecurity (Intelligent Systems Reference Library Book 151)

$
How to avoid leading/trailing spaces before/after AI in Cybersecurity (Intelligent Systems Reference Library Book 151) ?
Thanks.

***snippsat*** · Nov-25-2019, 01:59 PM

Use strip().

print(span.text.strip())

**buran** · (This post was last modified: Nov-25-2019, 03:24 PM by buran.)

Now, the question here is, do you need to make yet another request to get the book title. And the answer is - NO

import requests
from bs4 import BeautifulSoup
 
isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
  
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
  
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
  
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))
print(a.text.strip())

Pavel_47 · Nov-26-2019, 08:29 AM

The problem is the 1st URL doesn't contain whole information, e.g. there is no Publisher.
That's why I proceed in two steps: 1st request - the book is searched by its ISBN, on the html page that opens I search for link to the html, dedicated to a particular book (note that 1st request doesn't provide "dedicated" page ... there are also other books on this html). Then using 2nd url (or book dedicated page) I search for all necessary info - title, author, publisher, year, ratings ...

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[BeautifulSoup] Find </body>?	Winfried	3	3,244	Jul-21-2023, 11:25 AM Last Post: Gaurav_Kumar
	Get html body of URL	rama27	6	11,220	Aug-03-2020, 02:37 PM Last Post: snippsat
	Django- Remove leading zeros in values from database	ntuttle	1	4,300	Mar-07-2019, 07:30 PM Last Post: nilamo
	Is it possible to perform a PUT request by passing a req body instead of an ID	ary	0	2,337	Feb-20-2019, 05:55 AM Last Post: ary
	In CSV, how to write the header after writing the body?	Tim	18	19,936	Jan-06-2018, 01:54 PM Last Post: Larz60+

Extract tag body without leading/trailing spaces

User Panel Messages

Announcements