Jun-29-2018, 09:52 AM
The complete code:
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.
Here, if I print the content of soup
import requests from bs4 import BeautifulSoup def get_soup(url): try: response = requests.get(url) if response.status_code == 200: html = response.content return BeautifulSoup(html, "html.parser") except Exception as ex: print("error from " + url + ": " + str(ex)) def get_product_details(url): try: soup = get_soup(url) desc_list = [] # Get the product Name before the closing tags inserted by bs4 desc_list.append(soup.select('#detail-right')) start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>' end_tag = '<div id="qty">' # Loop through tags and append those between start_tag and end_tag flag_append = False for content in soup.findAll(): if(start_tag in str(content)): flag_append = True if(end_tag in str(content)): break if(flag_append): desc_list.append(content.contents) prod_details = {} prod_details['description'] = ''.join([str(i) for i in desc_list]) for item in desc_list: if item: print(item) return prod_details except Exception as ex: logger.warning('%s - %s', ex, url) if __name__ == '__main__': print("product1 description:") get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html") print("\n\nproduct2 description:") get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")You'll have to parse every single line (
item
) to extract only the text from the HTML tags.BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.
Here, if I print the content of soup
soup = get_soup(url) print(soup)I'm getting this (just one part of the soup content):
Output:<div id="detail-right">
<h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1>
<span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin -->
<span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the <body>
and <html>
in the middle of the original HTML source code.