Hi All,
I am fetching product description(with html tags) from a site using BeautifulSoup+Python3.6.
My code is as below-
def get_soup(url):
try:
response = requests.get(url)
if response.status_code == 200:
html = response.content
return BeautifulSoup(html, "html.parser")
except Exception as ex:
print("error from " + url + ": " + str(ex))
def get_product_details(url):
try:
desc_list = soup.select('p ~ ul')
prod_details['description'] = ''.join([str(i) for i in desc_list])
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
if __name__ == '__main__':
print("product1 description:")
get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
print("product2 description:")
get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
The problem with my above code is that it is not able to fetch description for some product urls.
Like in above code product2 description: is blank.Sample output-
product1 description:
<ul>
<li>Freestyle</li>
<li>Play along with 5 pre-set tunes: </li>
</ul><ul>
<li>Each string will play a note</li>
<li>Guitar has a whammy bar</li>
<li>2-in-1 volume control and power button </li>
<li>Simple and easy to use </li>
<li>Helps develop music appreciation </li>
<li>Requires 3 "AA" alkaline batteries (included)</li>
</ul>
product2 description:
So what changes I need to make here so that it may work for all types of product?
Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:
def get_product_details(url):
try:
soup = get_soup(url)
desc_list = []
# Get the product Name before the closing tags inserted by bs4
desc_list.append(soup.select('#detail-right'))
start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
end_tag = '<div id="qty">'
# Loop through tags and append those between start_tag and end_tag
flag_append = False
for content in soup.findAll():
if(start_tag in str(content)):
flag_append = True
if(end_tag in str(content)):
break
if(flag_append):
desc_list.append(content.contents)
prod_details = {}
prod_details['description'] = ''.join([str(i) for i in desc_list])
print(prod_details)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
the problem is with the way you try to select the description. Note that even the one that you *think* works, is in fact incomplete
implement following, which may be more successful:
1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly
Quote:1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly
I tried this but with the extras closing tags coming into "soup" I still can't get the right tag's contents.
(Jun-28-2018, 11:57 AM)gontajones Wrote: [ -> ]Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:
def get_product_details(url):
try:
soup = get_soup(url)
desc_list = []
# Get the product Name before the closing tags inserted by bs4
desc_list.append(soup.select('#detail-right'))
start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
end_tag = '<div id="qty">'
# Loop through tags and append those between start_tag and end_tag
flag_append = False
for content in soup.findAll():
if(start_tag in str(content)):
flag_append = True
if(end_tag in str(content)):
break
if(flag_append):
desc_list.append(content.contents)
prod_details = {}
prod_details['description'] = ''.join([str(i) for i in desc_list])
print(prod_details)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
Seems some problem here! When I ran your code, no description is returned for any product!
The complete code:
import requests
from bs4 import BeautifulSoup
def get_soup(url):
try:
response = requests.get(url)
if response.status_code == 200:
html = response.content
return BeautifulSoup(html, "html.parser")
except Exception as ex:
print("error from " + url + ": " + str(ex))
def get_product_details(url):
try:
soup = get_soup(url)
desc_list = []
# Get the product Name before the closing tags inserted by bs4
desc_list.append(soup.select('#detail-right'))
start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
end_tag = '<div id="qty">'
# Loop through tags and append those between start_tag and end_tag
flag_append = False
for content in soup.findAll():
if(start_tag in str(content)):
flag_append = True
if(end_tag in str(content)):
break
if(flag_append):
desc_list.append(content.contents)
prod_details = {}
prod_details['description'] = ''.join([str(i) for i in desc_list])
for item in desc_list:
if item:
print(item)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
if __name__ == '__main__':
print("product1 description:")
get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
print("\n\nproduct2 description:")
get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
You'll have to parse every single line (
item
) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.
Here, if I print the content of soup
soup = get_soup(url)
print(soup)
I'm getting this (just one part of the soup content):
Output:
<div id="detail-right">
<h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1>
<span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin -->
<span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the
<body>
and
<html>
in the middle of the original HTML source code.
(Jun-29-2018, 09:52 AM)gontajones Wrote: [ -> ]The complete code:
import requests
from bs4 import BeautifulSoup
def get_soup(url):
try:
response = requests.get(url)
if response.status_code == 200:
html = response.content
return BeautifulSoup(html, "html.parser")
except Exception as ex:
print("error from " + url + ": " + str(ex))
def get_product_details(url):
try:
soup = get_soup(url)
desc_list = []
# Get the product Name before the closing tags inserted by bs4
desc_list.append(soup.select('#detail-right'))
start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
end_tag = '<div id="qty">'
# Loop through tags and append those between start_tag and end_tag
flag_append = False
for content in soup.findAll():
if(start_tag in str(content)):
flag_append = True
if(end_tag in str(content)):
break
if(flag_append):
desc_list.append(content.contents)
prod_details = {}
prod_details['description'] = ''.join([str(i) for i in desc_list])
for item in desc_list:
if item:
print(item)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
if __name__ == '__main__':
print("product1 description:")
get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
print("\n\nproduct2 description:")
get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
You'll have to parse every single line (item
) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.
Here, if I print the content of soup
soup = get_soup(url)
print(soup)
I'm getting this (just one part of the soup content):
Output:
<div id="detail-right">
<h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1>
<span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin -->
<span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the <body>
and <html>
in the middle of the original HTML source code.
Please note here the product description , I am fetching is under the red highlighted area of the attached screen shot. I don't need name,product code,price in it; just the description with html tags.
Try this:
def get_product_details(url):
try:
soup = get_soup(url)
desc_list = soup.select('input ~ ul')
prod_details = {}
prod_details['description'] = ''.join([str(i) for i in desc_list])
print(prod_details)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
More
robust
desc_list = soup.select('#itemPrice1 ~ ul')
(Jun-29-2018, 11:14 AM)gontajones Wrote: [ -> ]Try this:
def get_product_details(url):
try:
soup = get_soup(url)
desc_list = soup.select('input ~ ul')
prod_details = {}
prod_details['description'] = ''.join([str(i) for i in desc_list])
print(prod_details)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
More robust
desc_list = soup.select('#itemPrice1 ~ ul')
It's strange with this one also I am getting blank description-
Output:
product1 description:
product2 description:
Process finished with exit code 0
Here is returning this:
Output:
product1 description:
{'description': '<ul>\n<li>Freestyle</li>\n<li>Play along with 5 pre-set tunes:\xa0</li>\n</ul><ul>\n<li>Each string will play a note</li>\n<li>Guitar has a whammy bar</li>\n<li>2-in-1 volume control and power button\xa0</li>\n<li>Simple and easy to use\xa0</li>\n<li>Helps develop music appreciation\xa0</li>\n<li>Requires 3 "AA" alkaline batteries (included)</li>\n</ul>'}
product2 description:
{'description': '<ul>\n<li>Authentic castle design features realistic stone facade</li>\n<li>Lookout tower with platform</li>\n<li>This pretend castle has a secret crawl-through door behind a pretend fireplace for kids to crawl in and out</li>\n<li>Castle climber also has a built-in slide for quick exits</li>\n<li>When assembled:</li>\n<ul>\n<li>Height(cm): 139</li>\n<li>Width(cm): 182</li>\n<li>Depth(cm): 156</li>\n</ul>\n<li>Battery: n/a</li>\n</ul>'}
Check what is coming in your
soup
variable:
soup = get_soup(url)
print(soup)