Python Forum

Full Version: Extract tag body without leading/trailing spaces
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
Executing of this code:
import requests
from bs4 import BeautifulSoup

isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
 
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
 
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

url = 'https://www.amazon.com' + a.get('href')
print(url)
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
span = soup.find('span', {'id': 'ebooksProductTitle'})
print(span.text)
results in this:

Output:
/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1 https://www.amazon.com/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1 AI in Cybersecurity (Intelligent Systems Reference Library Book 151)
$
How to avoid leading/trailing spaces before/after AI in Cybersecurity (Intelligent Systems Reference Library Book 151) ?
Thanks.
Use strip().
print(span.text.strip())
Now, the question here is, do you need to make yet another request to get the book title. And the answer is - NO
import requests
from bs4 import BeautifulSoup
 
isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
  
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
  
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
  
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))
print(a.text.strip())
The problem is the 1st URL doesn't contain whole information, e.g. there is no Publisher.
That's why I proceed in two steps: 1st request - the book is searched by its ISBN, on the html page that opens I search for link to the html, dedicated to a particular book (note that 1st request doesn't provide "dedicated" page ... there are also other books on this html). Then using 2nd url (or book dedicated page) I search for all necessary info - title, author, publisher, year, ratings ...