Python Forum
How to clean html content using BeautifulSoup in Python 3.6?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to clean html content using BeautifulSoup in Python 3.6?
#4
from bs4 import BeautifulSoup

html_data = '''\
<html>
<h2>No me</h2>
<p><img src="url" alt="Baby Bed &amp; Desk"><br><br></p>
<h1><strong>Convertible Baby Bed &amp; Desk<br><br></strong></h1>
<footer> Not me </footer>
</html>'''

soup = BeautifulSoup(html_data, 'lxml')
Test:
>>> whitelist = ['p', 'h1', 'b', 'i', 'u','br','li']
>>> clean = [tag for tag in soup.find_all() if tag.name in whitelist]
>>> clean
[<p><img alt="Baby Bed &amp; Desk" src="url"/><br/><br/></p>,
 <br/>,
 <br/>,
 <h1><strong>Convertible Baby Bed &amp; Desk<br/><br/></strong></h1>,
 <br/>,
 <br/>]

>>> clean = set(clean)
>>> clean
{<p><img alt="Baby Bed &amp; Desk" src="url"/><br/><br/></p>,
 <h1><strong>Convertible Baby Bed &amp; Desk<br/><br/></strong></h1>,
 <br/>}

>>> list(clean)[:-1]
[<p><img alt="Baby Bed &amp; Desk" src="url"/><br/><br/></p>,
 <h1><strong>Convertible Baby Bed &amp; Desk<br/><br/></strong></h1>]
Then have p and h1 back,inner tag inside p will still be there.
Reply


Messages In This Thread
RE: How to clean html content using BeautifulSoup in Python 3.6? - by snippsat - Apr-26-2018, 03:46 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Strange ModuleNotFound Error on BeautifulSoup for Python 3.11 Gaberson19 1 1,054 Jul-13-2023, 10:38 AM
Last Post: Gaurav_Kumar
  Retrieve website content using Python? Vadanane 1 1,301 Jan-16-2023, 09:55 AM
Last Post: Axel_Erfurt
  Getting a URL from Amazon using requests-html, or beautifulsoup aaander 1 1,704 Nov-06-2022, 10:59 PM
Last Post: snippsat
  requests-html + Beautifulsoup klaarnou 0 2,468 Mar-21-2022, 05:31 PM
Last Post: klaarnou
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,274 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,771 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  Python BeautifulSoup gives unusable text? dggo666 0 1,452 Oct-29-2021, 05:12 AM
Last Post: dggo666
  Python Web Scraping can not getting all HTML content yqqwe123 0 1,663 Aug-02-2021, 08:56 AM
Last Post: yqqwe123
  Python BeautifulSoup IndexError: list index out of range rhat398 1 6,293 May-28-2021, 09:09 PM
Last Post: Daring_T
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,708 Mar-14-2021, 12:23 PM
Last Post: rfeyer

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020