Python Forum

Full Version: how to make my product description fetching function generic?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi All,

I am fetching product description(with html tags) from a site using BeautifulSoup+Python3.6.

My code is as below-

def get_soup(url):
try:
    response = requests.get(url)
    if response.status_code == 200:
        html = response.content
        return BeautifulSoup(html, "html.parser")
except Exception as ex:
    print("error from " + url + ": " + str(ex))

def get_product_details(url):
try:
    desc_list = soup.select('p ~ ul')
    prod_details['description'] = ''.join([str(i) for i in desc_list])
    return prod_details
except Exception as ex:
    logger.warning('%s - %s', ex, url)

if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("product2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
The problem with my above code is that it is not able to fetch description for some product urls.
Like in above code product2 description: is blank.Sample output-
product1 description:
<ul>
<li>Freestyle</li>
<li>Play along with 5 pre-set tunes: </li>
</ul><ul>
<li>Each string will play a note</li>
<li>Guitar has a whammy bar</li>
<li>2-in-1 volume control and power button </li>
<li>Simple and easy to use </li>
<li>Helps develop music appreciation </li>
<li>Requires 3 "AA" alkaline batteries (included)</li>
</ul>
product2 description:
So what changes I need to make here so that it may work for all types of product?
Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:

def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)
the problem is with the way you try to select the description. Note that even the one that you *think* works, is in fact incomplete
implement following, which may be more successful:
1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly
Quote:1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly

I tried this but with the extras closing tags coming into "soup" I still can't get the right tag's contents.
(Jun-28-2018, 11:57 AM)gontajones Wrote: [ -> ]Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:

def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

Seems some problem here! When I ran your code, no description is returned for any product!
The complete code:

import requests
from bs4 import BeautifulSoup


def get_soup(url):

    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            return BeautifulSoup(html, "html.parser")
    except Exception as ex:
        print("error from " + url + ": " + str(ex))


def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        for item in desc_list:
            if item:
                print(item)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)


if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("\n\nproduct2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
You'll have to parse every single line (item) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.

Here, if I print the content of soup

soup = get_soup(url)
print(soup)
I'm getting this (just one part of the soup content):
Output:
<div id="detail-right"> <h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1> <span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin --> <span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the <body> and <html> in the middle of the original HTML source code.
(Jun-29-2018, 09:52 AM)gontajones Wrote: [ -> ]The complete code:

import requests
from bs4 import BeautifulSoup


def get_soup(url):

    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            return BeautifulSoup(html, "html.parser")
    except Exception as ex:
        print("error from " + url + ": " + str(ex))


def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        for item in desc_list:
            if item:
                print(item)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)


if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("\n\nproduct2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
You'll have to parse every single line (item) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.

Here, if I print the content of soup

soup = get_soup(url)
print(soup)
I'm getting this (just one part of the soup content):
Output:
<div id="detail-right"> <h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1> <span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin --> <span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the <body> and <html> in the middle of the original HTML source code.

Please note here the product description , I am fetching is under the red highlighted area of the attached screen shot. I don't need name,product code,price in it; just the description with html tags.
[Image: view?usp=sharing]
Try this:

def get_product_details(url):

    try:
        soup = get_soup(url)
        desc_list = soup.select('input ~ ul')
        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

More robust

desc_list = soup.select('#itemPrice1 ~ ul')
(Jun-29-2018, 11:14 AM)gontajones Wrote: [ -> ]Try this:

def get_product_details(url):

    try:
        soup = get_soup(url)
        desc_list = soup.select('input ~ ul')
        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

More robust

desc_list = soup.select('#itemPrice1 ~ ul')

It's strange with this one also I am getting blank description-
Output:
product1 description: product2 description: Process finished with exit code 0
Here is returning this:

Output:
product1 description: {'description': '<ul>\n<li>Freestyle</li>\n<li>Play along with 5 pre-set tunes:\xa0</li>\n</ul><ul>\n<li>Each string will play a note</li>\n<li>Guitar has a whammy bar</li>\n<li>2-in-1 volume control and power button\xa0</li>\n<li>Simple and easy to use\xa0</li>\n<li>Helps develop music appreciation\xa0</li>\n<li>Requires 3 "AA" alkaline batteries (included)</li>\n</ul>'} product2 description: {'description': '<ul>\n<li>Authentic castle design features realistic stone facade</li>\n<li>Lookout tower with platform</li>\n<li>This pretend castle has a secret crawl-through door behind a pretend fireplace for kids to crawl in and out</li>\n<li>Castle climber also has a built-in slide for quick exits</li>\n<li>When assembled:</li>\n<ul>\n<li>Height(cm): 139</li>\n<li>Width(cm): 182</li>\n<li>Depth(cm): 156</li>\n</ul>\n<li>Battery: n/a</li>\n</ul>'}
Check what is coming in your soup variable:

soup = get_soup(url)
print(soup)
Pages: 1 2