how to make my product description fetching function generic?

PrateekG · Jun-28-2018, 09:05 AM

Hi All,

I am fetching product description(with html tags) from a site using BeautifulSoup+Python3.6.

My code is as below-

def get_soup(url):
try:
    response = requests.get(url)
    if response.status_code == 200:
        html = response.content
        return BeautifulSoup(html, "html.parser")
except Exception as ex:
    print("error from " + url + ": " + str(ex))

def get_product_details(url):
try:
    desc_list = soup.select('p ~ ul')
    prod_details['description'] = ''.join([str(i) for i in desc_list])
    return prod_details
except Exception as ex:
    logger.warning('%s - %s', ex, url)

if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("product2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")

The problem with my above code is that it is not able to fetch description for some product urls.
Like in above code product2 description: is blank.Sample output-

product1 description:
<ul>
<li>Freestyle</li>
<li>Play along with 5 pre-set tunes: </li>
</ul><ul>
<li>Each string will play a note</li>
<li>Guitar has a whammy bar</li>
<li>2-in-1 volume control and power button </li>
<li>Simple and easy to use </li>
<li>Helps develop music appreciation </li>
<li>Requires 3 "AA" alkaline batteries (included)</li>
</ul>
product2 description:

So what changes I need to make here so that it may work for all types of product?

gontajones · Jun-28-2018, 11:57 AM

Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:

def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

**buran** · Jun-28-2018, 12:11 PM

the problem is with the way you try to select the description. Note that even the one that you *think* works, is in fact incomplete
implement following, which may be more successful:
1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly

gontajones · Jun-28-2018, 12:38 PM

Quote:1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly

I tried this but with the extras closing tags coming into "soup" I still can't get the right tag's contents.

PrateekG · Jun-29-2018, 05:09 AM

(Jun-28-2018, 11:57 AM)gontajones Wrote: Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:

def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

Seems some problem here! When I ran your code, no description is returned for any product!

gontajones · Jun-29-2018, 09:52 AM

The complete code:

import requests
from bs4 import BeautifulSoup


def get_soup(url):

    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            return BeautifulSoup(html, "html.parser")
    except Exception as ex:
        print("error from " + url + ": " + str(ex))


def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        for item in desc_list:
            if item:
                print(item)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)


if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("\n\nproduct2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")

You'll have to parse every single line (item) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.

Here, if I print the content of soup

soup = get_soup(url)
print(soup)

I'm getting this (just one part of the soup content):

Output:<div id="detail-right">
<h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR                              </h1>
<span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin -->
<span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>

The bs4 is closing the <body> and <html> in the middle of the original HTML source code.

PrateekG · (This post was last modified: Jun-29-2018, 10:23 AM by PrateekG.)

(Jun-29-2018, 09:52 AM)gontajones Wrote: The complete code:

import requests
from bs4 import BeautifulSoup


def get_soup(url):

    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            return BeautifulSoup(html, "html.parser")
    except Exception as ex:
        print("error from " + url + ": " + str(ex))


def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        for item in desc_list:
            if item:
                print(item)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)


if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("\n\nproduct2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")

You'll have to parse every single line (item) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.

Here, if I print the content of soup

soup = get_soup(url)
print(soup)

I'm getting this (just one part of the soup content):

Output:<div id="detail-right">
<h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR                              </h1>
<span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin -->
<span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>

The bs4 is closing the <body> and <html> in the middle of the original HTML source code.

Please note here the product description , I am fetching is under the red highlighted area of the attached screen shot. I don't need name,product code,price in it; just the description with html tags.
[Image: view?usp=sharing]

gontajones · (This post was last modified: Jun-29-2018, 11:21 AM by gontajones.)

Try this:

def get_product_details(url):

    try:
        soup = get_soup(url)
        desc_list = soup.select('input ~ ul')
        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

More robust

desc_list = soup.select('#itemPrice1 ~ ul')

PrateekG · Jun-29-2018, 12:32 PM

(Jun-29-2018, 11:14 AM)gontajones Wrote: Try this:

def get_product_details(url):

    try:
        soup = get_soup(url)
        desc_list = soup.select('input ~ ul')
        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

More robust

desc_list = soup.select('#itemPrice1 ~ ul')

It's strange with this one also I am getting blank description-

Output:product1 description:
product2 description:

Process finished with exit code 0

gontajones · Jun-29-2018, 12:36 PM

Here is returning this:

Output:product1 description:
{'description': '<ul>\n<li>Freestyle</li>\n<li>Play along with 5 pre-set tunes:\xa0</li>\n</ul><ul>\n<li>Each string will play a note</li>\n<li>Guitar has a whammy bar</li>\n<li>2-in-1 volume control and power button\xa0</li>\n<li>Simple and easy to use\xa0</li>\n<li>Helps develop music appreciation\xa0</li>\n<li>Requires 3 "AA" alkaline batteries (included)</li>\n</ul>'}


product2 description:
{'description': '<ul>\n<li>Authentic castle design features realistic stone facade</li>\n<li>Lookout tower with platform</li>\n<li>This pretend castle has a secret crawl-through door behind a pretend fireplace for kids to crawl in and out</li>\n<li>Castle climber also has a built-in slide for quick exits</li>\n<li>When assembled:</li>\n<ul>\n<li>Height(cm): 139</li>\n<li>Width(cm): 182</li>\n<li>Depth(cm): 156</li>\n</ul>\n<li>Battery: n/a</li>\n</ul>'}

Check what is coming in your soup variable:

soup = get_soup(url)
print(soup)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Fetching Images from DB in Django	Dexty	2	1,705	Mar-15-2024, 08:43 AM Last Post: firn100
	All product links to products on a website	MarionStorm	0	1,085	Jun-02-2022, 11:17 PM Last Post: MarionStorm
	fetching, parsing data from Wikipedia	apollo	2	3,543	May-06-2021, 08:08 PM Last Post: snippsat
	Fetching and Parsing XML Data	FalseFact	3	3,251	Apr-01-2019, 10:21 AM Last Post: Larz60+
	My Django 2.0.6 logging is not working while product merging	PrateekG	0	2,155	Jul-26-2018, 02:24 PM Last Post: PrateekG
	Need help to get product details using BeautifulSoup+Python3.6!	PrateekG	2	2,871	Jun-27-2018, 08:52 AM Last Post: PrateekG
	Getting 'list index out of range' while fetching product details using BeautifulSoup?	PrateekG	8	8,146	Jun-06-2018, 12:15 PM Last Post: snippsat
	Unable to fetch product url using BeautifulSoup with Python3.6	PrateekG	6	4,236	Jun-05-2018, 05:49 PM Last Post: PrateekG
	Generic If Popup Exists Close It Script	digitalmatic7	1	2,480	Feb-18-2018, 07:24 AM Last Post: metulburr

how to make my product description fetching function generic?

User Panel Messages

Announcements