how to make my product description fetching function generic? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: how to make my product description fetching function generic? (/thread-11209.html) Pages:
1
2
|
how to make my product description fetching function generic? - PrateekG - Jun-28-2018 Hi All, I am fetching product description(with html tags) from a site using BeautifulSoup+Python3.6. My code is as below- def get_soup(url): try: response = requests.get(url) if response.status_code == 200: html = response.content return BeautifulSoup(html, "html.parser") except Exception as ex: print("error from " + url + ": " + str(ex)) def get_product_details(url): try: desc_list = soup.select('p ~ ul') prod_details['description'] = ''.join([str(i) for i in desc_list]) return prod_details except Exception as ex: logger.warning('%s - %s', ex, url) if __name__ == '__main__': print("product1 description:") get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html") print("product2 description:") get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")The problem with my above code is that it is not able to fetch description for some product urls. Like in above code product2 description: is blank.Sample output- product1 description: <ul> <li>Freestyle</li> <li>Play along with 5 pre-set tunes: </li> </ul><ul> <li>Each string will play a note</li> <li>Guitar has a whammy bar</li> <li>2-in-1 volume control and power button </li> <li>Simple and easy to use </li> <li>Helps develop music appreciation </li> <li>Requires 3 "AA" alkaline batteries (included)</li> </ul> product2 description:So what changes I need to make here so that it may work for all types of product? RE: how to make my product description fetching function generic? - gontajones - Jun-28-2018 Here bs4 are inserting closing tags before the end of the html source. So I did this, it's only an idea: def get_product_details(url): try: soup = get_soup(url) desc_list = [] # Get the product Name before the closing tags inserted by bs4 desc_list.append(soup.select('#detail-right')) start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>' end_tag = '<div id="qty">' # Loop through tags and append those between start_tag and end_tag flag_append = False for content in soup.findAll(): if(start_tag in str(content)): flag_append = True if(end_tag in str(content)): break if(flag_append): desc_list.append(content.contents) prod_details = {} prod_details['description'] = ''.join([str(i) for i in desc_list]) print(prod_details) return prod_details except Exception as ex: logger.warning('%s - %s', ex, url) RE: how to make my product description fetching function generic? - buran - Jun-28-2018 the problem is with the way you try to select the description. Note that even the one that you *think* works, is in fact incomplete implement following, which may be more successful: 1. find div with id="detail-right" 2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly RE: how to make my product description fetching function generic? - gontajones - Jun-28-2018 Quote:1. find div with id="detail-right" I tried this but with the extras closing tags coming into "soup" I still can't get the right tag's contents. RE: how to make my product description fetching function generic? - PrateekG - Jun-29-2018 (Jun-28-2018, 11:57 AM)gontajones Wrote: Here bs4 are inserting closing tags before the end of the html source. Seems some problem here! When I ran your code, no description is returned for any product! RE: how to make my product description fetching function generic? - gontajones - Jun-29-2018 The complete code: import requests from bs4 import BeautifulSoup def get_soup(url): try: response = requests.get(url) if response.status_code == 200: html = response.content return BeautifulSoup(html, "html.parser") except Exception as ex: print("error from " + url + ": " + str(ex)) def get_product_details(url): try: soup = get_soup(url) desc_list = [] # Get the product Name before the closing tags inserted by bs4 desc_list.append(soup.select('#detail-right')) start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>' end_tag = '<div id="qty">' # Loop through tags and append those between start_tag and end_tag flag_append = False for content in soup.findAll(): if(start_tag in str(content)): flag_append = True if(end_tag in str(content)): break if(flag_append): desc_list.append(content.contents) prod_details = {} prod_details['description'] = ''.join([str(i) for i in desc_list]) for item in desc_list: if item: print(item) return prod_details except Exception as ex: logger.warning('%s - %s', ex, url) if __name__ == '__main__': print("product1 description:") get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html") print("\n\nproduct2 description:") get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")You'll have to parse every single line ( item ) to extract only the text from the HTML tags.BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags. Here, if I print the content of soup soup = get_soup(url) print(soup)I'm getting this (just one part of the soup content): The bs4 is closing the <body> and <html> in the middle of the original HTML source code.
RE: how to make my product description fetching function generic? - PrateekG - Jun-29-2018 (Jun-29-2018, 09:52 AM)gontajones Wrote: The complete code: Please note here the product description , I am fetching is under the red highlighted area of the attached screen shot. I don't need name,product code,price in it; just the description with html tags. RE: how to make my product description fetching function generic? - gontajones - Jun-29-2018 Try this: def get_product_details(url): try: soup = get_soup(url) desc_list = soup.select('input ~ ul') prod_details = {} prod_details['description'] = ''.join([str(i) for i in desc_list]) print(prod_details) return prod_details except Exception as ex: logger.warning('%s - %s', ex, url) More robust desc_list = soup.select('#itemPrice1 ~ ul') RE: how to make my product description fetching function generic? - PrateekG - Jun-29-2018 (Jun-29-2018, 11:14 AM)gontajones Wrote: Try this: It's strange with this one also I am getting blank description-
RE: how to make my product description fetching function generic? - gontajones - Jun-29-2018 Here is returning this: Check what is coming in your soup variable:soup = get_soup(url) print(soup) |