Beautiful Soup - Delete All HTML - Except Specific Classes

Beautiful Soup - Delete All HTML - Except Specific Classes - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Beautiful Soup - Delete All HTML - Except Specific Classes (/thread-11508.html)

Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-12-2018

Hi all,

I have been looking everyhwere for this concept.

I wanted to delete all html except for the classes. I have listed
The idea is below. The code is not correct

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')


for tag in soup():
   if not class in ["1", "2"]:
        tag.decompose()
print(soup)

I cant find any code samples to show me this idea

Basically delete all html except for those classes listed in the list

Result:
Everything Deleted except:
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p class="3">article3</p>

please do advise thank you

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - snippsat - Jul-12-2018

Can use extract() and set class to True.

from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())

Output:<h2 class="1">section1</h2>
<p class="2">article1</p>
<p class="3">article3</p>

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - nilamo - Jul-12-2018

.select() lets you do css selectors, so you can just use that:

>>> import bs4
>>> html = '''
... <h2 class="1">section1</h2>
... <p class="2">article1</p>
... <p>article2</p>
... <p class="3">article3</p>
... <h1> Lorem Ipsum</h1>
... <p> 3 Lorem ipsum dolor </p>",'lxml')
... '''
>>>
>>> soup = bs4.BeautifulSoup(html, 'html.parser')
>>> soup.select(".1, .2")
[<h2 class="1">section1</h2>, <p class="2">article1</p>]

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-12-2018

Thank you for these ideas - let me do some testing and pop back

(Jul-12-2018, 04:00 PM)snippsat Wrote: Can use extract() and set class to True.

from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())

Output:<h2 class="1">section1</h2>
<p class="2">article1</p>
<p class="3">article3</p>

Thank you for this idea,

is it possible for me to make a list out of this

for tag in soup.find_all(['h2', 'p'], class_=True):

example a list of classes
for tag in soup.find_all([['h2', 'p'], class_='1','2']):

So I may store these in a whitelist.

Then i can extract these or decompose the ones not in my list.

What i am trying to do basically is just get rid of everything unless it happens to be in my whitelist List of Classes.

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - nilamo - Jul-12-2018

(Jul-12-2018, 04:36 PM)dj99 Wrote: What i am trying to do basically is just get rid of everything

Don't think about deleting or removing anything, as that's unnecessarily complex. Just only get what you want. .select() will be much easier for this.

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-12-2018

Hello N,

Ok i guess I could do that. if the latter is more complex.

I will do some coding and see if that works better

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - snippsat - Jul-12-2018

(Jul-12-2018, 04:36 PM)dj99 Wrote: is it possible for me to make a list out of this

lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=True)]

Or can also mix in regex for just 1 and 2.

lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=re.compile("[1,2]"))]

.select() as mention is also fine for this/better CSS Selector Reference.
Also as mention bye @nilamo delete can be confusing(as i felt for),just extract what you want.

RE: Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-13-2018

thank you for all the help

that is another idea i can work with
Have a great weekend all
:)