Python Forum
Extract tag body without leading/trailing spaces
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract tag body without leading/trailing spaces
#1
Hello,
Executing of this code:
import requests
from bs4 import BeautifulSoup

isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
 
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
 
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))

url = 'https://www.amazon.com' + a.get('href')
print(url)
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
span = soup.find('span', {'id': 'ebooksProductTitle'})
print(span.text)
results in this:

Output:
/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1 https://www.amazon.com/Cybersecurity-Intelligent-Systems-Reference-Library-ebook/dp/B07FKL34B8/ref=sr_1_1?keywords=9783319988412&qid=1574687282&sr=8-1 AI in Cybersecurity (Intelligent Systems Reference Library Book 151)
$
How to avoid leading/trailing spaces before/after AI in Cybersecurity (Intelligent Systems Reference Library Book 151) ?
Thanks.
Reply
#2
Use strip().
print(span.text.strip())
Reply
#3
Now, the question here is, do you need to make yet another request to get the book title. And the answer is - NO
import requests
from bs4 import BeautifulSoup
 
isbn = 9783319988412
url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss'
  
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,*/*',
    'Accept-Language': 'en,en-US;q=0.7,en;q=0.3',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive'}
  
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
  
# using find
a = soup.find('a', {'class': 'a-link-normal a-text-normal'})
print(a.get('href'))
print(a.text.strip())
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
The problem is the 1st URL doesn't contain whole information, e.g. there is no Publisher.
That's why I proceed in two steps: 1st request - the book is searched by its ISBN, on the html page that opens I search for link to the html, dedicated to a particular book (note that 1st request doesn't provide "dedicated" page ... there are also other books on this html). Then using 2nd url (or book dedicated page) I search for all necessary info - title, author, publisher, year, ratings ...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [BeautifulSoup] Find </body>? Winfried 3 1,243 Jul-21-2023, 11:25 AM
Last Post: Gaurav_Kumar
  Get html body of URL rama27 6 3,361 Aug-03-2020, 02:37 PM
Last Post: snippsat
  Django- Remove leading zeros in values from database ntuttle 1 3,463 Mar-07-2019, 07:30 PM
Last Post: nilamo
  Is it possible to perform a PUT request by passing a req body instead of an ID ary 0 1,797 Feb-20-2019, 05:55 AM
Last Post: ary
  In CSV, how to write the header after writing the body? Tim 18 14,422 Jan-06-2018, 01:54 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020