Python Forum

Full Version: Beautiful Soup - Delete All HTML - Except Specific Classes
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

I have been looking everyhwere for this concept.

I wanted to delete all html except for the classes. I have listed
The idea is below. The code is not correct

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')


for tag in soup():
   if not class in ["1", "2"]:
        tag.decompose()
print(soup)
I cant find any code samples to show me this idea

Basically delete all html except for those classes listed in the list

Result:
Everything Deleted except:
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p class="3">article3</p>

please do advise thank you
Can use extract() and set class to True.
from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())
Output:
<h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p>
.select() lets you do css selectors, so you can just use that:
>>> import bs4
>>> html = '''
... <h2 class="1">section1</h2>
... <p class="2">article1</p>
... <p>article2</p>
... <p class="3">article3</p>
... <h1> Lorem Ipsum</h1>
... <p> 3 Lorem ipsum dolor </p>",'lxml')
... '''
>>>
>>> soup = bs4.BeautifulSoup(html, 'html.parser')
>>> soup.select(".1, .2")
[<h2 class="1">section1</h2>, <p class="2">article1</p>]
Thank you for these ideas - let me do some testing and pop back

(Jul-12-2018, 04:00 PM)snippsat Wrote: [ -> ]Can use extract() and set class to True.
from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())
Output:
<h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p>

Thank you for this idea,

is it possible for me to make a list out of this



for tag in soup.find_all(['h2', 'p'], class_=True):
example a list of classes
for tag in soup.find_all([['h2', 'p'], class_='1','2']):
So I may store these in a whitelist.

Then i can extract these or decompose the ones not in my list.

What i am trying to do basically is just get rid of everything unless it happens to be in my whitelist List of Classes.
(Jul-12-2018, 04:36 PM)dj99 Wrote: [ -> ]What i am trying to do basically is just get rid of everything

Don't think about deleting or removing anything, as that's unnecessarily complex. Just only get what you want. .select() will be much easier for this.
Hello N,

Ok i guess I could do that. if the latter is more complex.

I will do some coding and see if that works better
(Jul-12-2018, 04:36 PM)dj99 Wrote: [ -> ]is it possible for me to make a list out of this
lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=True)]
Or can also mix in regex for just 1 and 2.
lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=re.compile("[1,2]"))]
.select() as mention is also fine for this/better CSS Selector Reference.
Also as mention bye @nilamo delete can be confusing(as i felt for),just extract what you want.
thank you for all the help

that is another idea i can work with
Have a great weekend all
:)