Python Forum
how to make my product description fetching function generic?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
how to make my product description fetching function generic?
#1
Hi All,

I am fetching product description(with html tags) from a site using BeautifulSoup+Python3.6.

My code is as below-

def get_soup(url):
try:
    response = requests.get(url)
    if response.status_code == 200:
        html = response.content
        return BeautifulSoup(html, "html.parser")
except Exception as ex:
    print("error from " + url + ": " + str(ex))

def get_product_details(url):
try:
    desc_list = soup.select('p ~ ul')
    prod_details['description'] = ''.join([str(i) for i in desc_list])
    return prod_details
except Exception as ex:
    logger.warning('%s - %s', ex, url)

if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("product2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
The problem with my above code is that it is not able to fetch description for some product urls.
Like in above code product2 description: is blank.Sample output-
product1 description:
<ul>
<li>Freestyle</li>
<li>Play along with 5 pre-set tunes: </li>
</ul><ul>
<li>Each string will play a note</li>
<li>Guitar has a whammy bar</li>
<li>2-in-1 volume control and power button </li>
<li>Simple and easy to use </li>
<li>Helps develop music appreciation </li>
<li>Requires 3 "AA" alkaline batteries (included)</li>
</ul>
product2 description:
So what changes I need to make here so that it may work for all types of product?
Reply
#2
Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:

def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)
Reply
#3
the problem is with the way you try to select the description. Note that even the one that you *think* works, is in fact incomplete
implement following, which may be more successful:
1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
Quote:1. find div with id="detail-right"
2. iterate over child tags until div with id="qty". Process each p or ul tag accordingly

I tried this but with the extras closing tags coming into "soup" I still can't get the right tag's contents.
Reply
#5
(Jun-28-2018, 11:57 AM)gontajones Wrote: Here bs4 are inserting closing tags before the end of the html source.
So I did this, it's only an idea:

def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

Seems some problem here! When I ran your code, no description is returned for any product!
Reply
#6
The complete code:

import requests
from bs4 import BeautifulSoup


def get_soup(url):

    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            return BeautifulSoup(html, "html.parser")
    except Exception as ex:
        print("error from " + url + ": " + str(ex))


def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        for item in desc_list:
            if item:
                print(item)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)


if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("\n\nproduct2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
You'll have to parse every single line (item) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.

Here, if I print the content of soup

soup = get_soup(url)
print(soup)
I'm getting this (just one part of the soup content):
Output:
<div id="detail-right"> <h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1> <span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin --> <span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the <body> and <html> in the middle of the original HTML source code.
Reply
#7
(Jun-29-2018, 09:52 AM)gontajones Wrote: The complete code:

import requests
from bs4 import BeautifulSoup


def get_soup(url):

    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            return BeautifulSoup(html, "html.parser")
    except Exception as ex:
        print("error from " + url + ": " + str(ex))


def get_product_details(url):

    try:
        soup = get_soup(url)

        desc_list = []
        # Get the product Name before the closing tags inserted by bs4
        desc_list.append(soup.select('#detail-right'))
        start_tag = '<input id="itemPrice1" name="nuPrice1" type="hidden" value=""/>'
        end_tag = '<div id="qty">'

        # Loop through tags and append those between start_tag and end_tag
        flag_append = False
        for content in soup.findAll():
            if(start_tag in str(content)):
                flag_append = True
            if(end_tag in str(content)):
                break
            if(flag_append):
                desc_list.append(content.contents)

        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        for item in desc_list:
            if item:
                print(item)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)


if __name__ == '__main__':
    print("product1 description:")
    get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
    print("\n\nproduct2 description:")
    get_product_details("http://www.aprisin.com.sg/p-1052-172083littletikesclassiccastle.html")
You'll have to parse every single line (item) to extract only the text from the HTML tags.
BTW remember that this is a poor solution. The right one should be getting BeautifulSoup parse the HTML source without extras closing tags.

Here, if I print the content of soup

soup = get_soup(url)
print(soup)
I'm getting this (just one part of the soup content):
Output:
<div id="detail-right"> <h1 id="detail-name">LIttle Tikes PopTunes™ GUITAR </h1> <span style="color: #000; font-size: 12px; font-weight: normal;">Product Code : LT636226</span><br/> <!-- price update attributes begin --> <span class="price"><span class="linethrough">S$49.00</span> S$39.00</span></div></div></div></form></div></div></div></div></div></body></html>
The bs4 is closing the <body> and <html> in the middle of the original HTML source code.

Please note here the product description , I am fetching is under the red highlighted area of the attached screen shot. I don't need name,product code,price in it; just the description with html tags.
[Image: view?usp=sharing]
Reply
#8
Try this:

def get_product_details(url):

    try:
        soup = get_soup(url)
        desc_list = soup.select('input ~ ul')
        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

More robust

desc_list = soup.select('#itemPrice1 ~ ul')
Reply
#9
(Jun-29-2018, 11:14 AM)gontajones Wrote: Try this:

def get_product_details(url):

    try:
        soup = get_soup(url)
        desc_list = soup.select('input ~ ul')
        prod_details = {}
        prod_details['description'] = ''.join([str(i) for i in desc_list])
        print(prod_details)
        return prod_details
    except Exception as ex:
        logger.warning('%s - %s', ex, url)

More robust

desc_list = soup.select('#itemPrice1 ~ ul')

It's strange with this one also I am getting blank description-
Output:
product1 description: product2 description: Process finished with exit code 0
Reply
#10
Here is returning this:

Output:
product1 description: {'description': '<ul>\n<li>Freestyle</li>\n<li>Play along with 5 pre-set tunes:\xa0</li>\n</ul><ul>\n<li>Each string will play a note</li>\n<li>Guitar has a whammy bar</li>\n<li>2-in-1 volume control and power button\xa0</li>\n<li>Simple and easy to use\xa0</li>\n<li>Helps develop music appreciation\xa0</li>\n<li>Requires 3 "AA" alkaline batteries (included)</li>\n</ul>'} product2 description: {'description': '<ul>\n<li>Authentic castle design features realistic stone facade</li>\n<li>Lookout tower with platform</li>\n<li>This pretend castle has a secret crawl-through door behind a pretend fireplace for kids to crawl in and out</li>\n<li>Castle climber also has a built-in slide for quick exits</li>\n<li>When assembled:</li>\n<ul>\n<li>Height(cm): 139</li>\n<li>Width(cm): 182</li>\n<li>Depth(cm): 156</li>\n</ul>\n<li>Battery: n/a</li>\n</ul>'}
Check what is coming in your soup variable:

soup = get_soup(url)
print(soup)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Fetching Images from DB in Django Dexty 2 1,623 Mar-15-2024, 08:43 AM
Last Post: firn100
  All product links to products on a website MarionStorm 0 1,056 Jun-02-2022, 11:17 PM
Last Post: MarionStorm
  fetching, parsing data from Wikipedia apollo 2 3,503 May-06-2021, 08:08 PM
Last Post: snippsat
  Fetching and Parsing XML Data FalseFact 3 3,200 Apr-01-2019, 10:21 AM
Last Post: Larz60+
  My Django 2.0.6 logging is not working while product merging PrateekG 0 2,112 Jul-26-2018, 02:24 PM
Last Post: PrateekG
  Need help to get product details using BeautifulSoup+Python3.6! PrateekG 2 2,836 Jun-27-2018, 08:52 AM
Last Post: PrateekG
  Getting 'list index out of range' while fetching product details using BeautifulSoup? PrateekG 8 8,044 Jun-06-2018, 12:15 PM
Last Post: snippsat
  Unable to fetch product url using BeautifulSoup with Python3.6 PrateekG 6 4,142 Jun-05-2018, 05:49 PM
Last Post: PrateekG
  Generic If Popup Exists Close It Script digitalmatic7 1 2,451 Feb-18-2018, 07:24 AM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020