Apr-26-2018, 03:46 PM
from bs4 import BeautifulSoup html_data = '''\ <html> <h2>No me</h2> <p><img src="url" alt="Baby Bed & Desk"><br><br></p> <h1><strong>Convertible Baby Bed & Desk<br><br></strong></h1> <footer> Not me </footer> </html>''' soup = BeautifulSoup(html_data, 'lxml')Test:
>>> whitelist = ['p', 'h1', 'b', 'i', 'u','br','li'] >>> clean = [tag for tag in soup.find_all() if tag.name in whitelist] >>> clean [<p><img alt="Baby Bed & Desk" src="url"/><br/><br/></p>, <br/>, <br/>, <h1><strong>Convertible Baby Bed & Desk<br/><br/></strong></h1>, <br/>, <br/>] >>> clean = set(clean) >>> clean {<p><img alt="Baby Bed & Desk" src="url"/><br/><br/></p>, <h1><strong>Convertible Baby Bed & Desk<br/><br/></strong></h1>, <br/>} >>> list(clean)[:-1] [<p><img alt="Baby Bed & Desk" src="url"/><br/><br/></p>, <h1><strong>Convertible Baby Bed & Desk<br/><br/></strong></h1>]Then have
p
and h1
back,inner tag inside p will still be there.