Python Forum
How to clean html content using BeautifulSoup in Python 3.6?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to clean html content using BeautifulSoup in Python 3.6?
#1
I have created a pandas DataFrame which stores the html content of a product description. The html content is like below-

<p><img src="//ad.xyz.com/s/files/1/2352/2977/files/logo-3_large.png?v=1512189111" alt="10mois 5 in 1 Convertible Baby Bed &amp; Desk"><br><br></p>\n<h1><strong>10 mois 5 in 1 Convertible Baby Bed &amp; Desk<br><br></strong></h1>

Now I need to write a function which can parse the html tags using BeautifulSoup and can return a filtered version with whitelisted tags only.

Here whitelisted tags is basically a list of desired tags as below-
whitelist = ['p', 'h1','b','i','u','br','li']

Can anyone please help me to achieve this using Python 3.6?

Thanks!
Reply
#2
Please, post your code in code tags, any traceback in error tags and ask specific questions.
I continue to not understand why you put html in pandas data frame. For me normal/natural approach would be to request the page, parse the html with BeautifulSoup and extract desired data and only then if you are going to process data, put them in a dataframe.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Hi Buran,

I am getting a response from an e-commerce site in the form of json and this response contains many attributes in which html content is in 'body_html' attribute.
So after getting the response I amd storing only the html content in a dataframe.
If there is an alternate approach please suggest me that also.

I am trying with below code-
product_description = data["body_html"]
def filter_product_description(product_description):
	whitelist = ['p', 'h1','b','strong','span']
    html_series = product_description.all()
    # print(html_series)
    keep = []
    for html_description in html_series:

        soup = BeautifulSoup(html_description, "html.parser")

        for tag in soup.findAll(True):
            if tag in whitelist:
                keep.append(tag)

    return keep
res= filter_product_description(product_description)
print(res)
I want to use this function as cleaning up of html content which returns inly the text which have the tags listed in whitelist.

Thanks!
Reply
#4
from bs4 import BeautifulSoup

html_data = '''\
<html>
<h2>No me</h2>
<p><img src="url" alt="Baby Bed &amp; Desk"><br><br></p>
<h1><strong>Convertible Baby Bed &amp; Desk<br><br></strong></h1>
<footer> Not me </footer>
</html>'''

soup = BeautifulSoup(html_data, 'lxml')
Test:
>>> whitelist = ['p', 'h1', 'b', 'i', 'u','br','li']
>>> clean = [tag for tag in soup.find_all() if tag.name in whitelist]
>>> clean
[<p><img alt="Baby Bed &amp; Desk" src="url"/><br/><br/></p>,
 <br/>,
 <br/>,
 <h1><strong>Convertible Baby Bed &amp; Desk<br/><br/></strong></h1>,
 <br/>,
 <br/>]

>>> clean = set(clean)
>>> clean
{<p><img alt="Baby Bed &amp; Desk" src="url"/><br/><br/></p>,
 <h1><strong>Convertible Baby Bed &amp; Desk<br/><br/></strong></h1>,
 <br/>}

>>> list(clean)[:-1]
[<p><img alt="Baby Bed &amp; Desk" src="url"/><br/><br/></p>,
 <h1><strong>Convertible Baby Bed &amp; Desk<br/><br/></strong></h1>]
Then have p and h1 back,inner tag inside p will still be there.
Reply
#5
what if I don't want include <img>, <iframe> tags?
Reply
#6
(Apr-27-2018, 07:05 AM)PrateekG Wrote: what if I don't want include <img>, <iframe> tags?
The have to do second cleaning,to clean tags inside other tag.
Now it start to get complex,usually this is the other way around.
Which mean that you parse date you do want,an not like now try filter out all data that's not wanted Doh
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Strange ModuleNotFound Error on BeautifulSoup for Python 3.11 Gaberson19 1 919 Jul-13-2023, 10:38 AM
Last Post: Gaurav_Kumar
  Retrieve website content using Python? Vadanane 1 1,196 Jan-16-2023, 09:55 AM
Last Post: Axel_Erfurt
  Getting a URL from Amazon using requests-html, or beautifulsoup aaander 1 1,618 Nov-06-2022, 10:59 PM
Last Post: snippsat
  requests-html + Beautifulsoup klaarnou 0 2,399 Mar-21-2022, 05:31 PM
Last Post: klaarnou
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,161 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,687 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  Python BeautifulSoup gives unusable text? dggo666 0 1,405 Oct-29-2021, 05:12 AM
Last Post: dggo666
  Python Web Scraping can not getting all HTML content yqqwe123 0 1,616 Aug-02-2021, 08:56 AM
Last Post: yqqwe123
  Python BeautifulSoup IndexError: list index out of range rhat398 1 6,163 May-28-2021, 09:09 PM
Last Post: Daring_T
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,530 Mar-14-2021, 12:23 PM
Last Post: rfeyer

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020