Extract tag body without leading/trailing spaces - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Extract tag body without leading/trailing spaces (/thread-22746.html) |
Extract tag body without leading/trailing spaces - Pavel_47 - Nov-25-2019 Hello, Executing of this code: import requests from bs4 import BeautifulSoup isbn = 9783319988412 url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', 'Accept': 'text/html,*/*', 'Accept-Language': 'en,en-US;q=0.7,en;q=0.3', 'X-Requested-With': 'XMLHttpRequest', 'Connection': 'keep-alive'} resp = requests.get(url, headers=headers) soup = BeautifulSoup(resp.text, 'lxml') # using find a = soup.find('a', {'class': 'a-link-normal a-text-normal'}) print(a.get('href')) url = 'https://www.amazon.com' + a.get('href') print(url) resp = requests.get(url, headers=headers) soup = BeautifulSoup(resp.text, 'lxml') span = soup.find('span', {'id': 'ebooksProductTitle'}) print(span.text)results in this: $How to avoid leading/trailing spaces before/after AI in Cybersecurity (Intelligent Systems Reference Library Book 151) ? Thanks. RE: Extract tag body without leading/trailing spaces - snippsat - Nov-25-2019 Use strip(). print(span.text.strip()) RE: Extract tag body without leading/trailing spaces - buran - Nov-25-2019 Now, the question here is, do you need to make yet another request to get the book title. And the answer is - NO import requests from bs4 import BeautifulSoup isbn = 9783319988412 url = f'http://www.amazon.com/s?k={isbn}&ref=nb_sb_noss' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0', 'Accept': 'text/html,*/*', 'Accept-Language': 'en,en-US;q=0.7,en;q=0.3', 'X-Requested-With': 'XMLHttpRequest', 'Connection': 'keep-alive'} resp = requests.get(url, headers=headers) soup = BeautifulSoup(resp.text, 'lxml') # using find a = soup.find('a', {'class': 'a-link-normal a-text-normal'}) print(a.get('href')) print(a.text.strip()) RE: Extract tag body without leading/trailing spaces - Pavel_47 - Nov-26-2019 The problem is the 1st URL doesn't contain whole information, e.g. there is no Publisher. That's why I proceed in two steps: 1st request - the book is searched by its ISBN, on the html page that opens I search for link to the html, dedicated to a particular book (note that 1st request doesn't provide "dedicated" page ... there are also other books on this html). Then using 2nd url (or book dedicated page) I search for all necessary info - title, author, publisher, year, ratings ... |