Python Forum
Beautiful Soup - Delete All HTML - Except Specific Classes
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beautiful Soup - Delete All HTML - Except Specific Classes
#1
Hi all,

I have been looking everyhwere for this concept.

I wanted to delete all html except for the classes. I have listed
The idea is below. The code is not correct

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')


for tag in soup():
   if not class in ["1", "2"]:
        tag.decompose()
print(soup)
I cant find any code samples to show me this idea

Basically delete all html except for those classes listed in the list

Result:
Everything Deleted except:
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p class="3">article3</p>

please do advise thank you
Reply
#2
Can use extract() and set class to True.
from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())
Output:
<h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p>
Reply
#3
.select() lets you do css selectors, so you can just use that:
>>> import bs4
>>> html = '''
... <h2 class="1">section1</h2>
... <p class="2">article1</p>
... <p>article2</p>
... <p class="3">article3</p>
... <h1> Lorem Ipsum</h1>
... <p> 3 Lorem ipsum dolor </p>",'lxml')
... '''
>>>
>>> soup = bs4.BeautifulSoup(html, 'html.parser')
>>> soup.select(".1, .2")
[<h2 class="1">section1</h2>, <p class="2">article1</p>]
Reply
#4
Thank you for these ideas - let me do some testing and pop back

(Jul-12-2018, 04:00 PM)snippsat Wrote: Can use extract() and set class to True.
from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())
Output:
<h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p>

Thank you for this idea,

is it possible for me to make a list out of this



for tag in soup.find_all(['h2', 'p'], class_=True):
example a list of classes
for tag in soup.find_all([['h2', 'p'], class_='1','2']):
So I may store these in a whitelist.

Then i can extract these or decompose the ones not in my list.

What i am trying to do basically is just get rid of everything unless it happens to be in my whitelist List of Classes.
Reply
#5
(Jul-12-2018, 04:36 PM)dj99 Wrote: What i am trying to do basically is just get rid of everything

Don't think about deleting or removing anything, as that's unnecessarily complex. Just only get what you want. .select() will be much easier for this.
Reply
#6
Hello N,

Ok i guess I could do that. if the latter is more complex.

I will do some coding and see if that works better
Reply
#7
(Jul-12-2018, 04:36 PM)dj99 Wrote: is it possible for me to make a list out of this
lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=True)]
Or can also mix in regex for just 1 and 2.
lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=re.compile("[1,2]"))]
.select() as mention is also fine for this/better CSS Selector Reference.
Also as mention bye @nilamo delete can be confusing(as i felt for),just extract what you want.
Reply
#8
thank you for all the help

that is another idea i can work with
Have a great weekend all
:)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB BrandonKastning 8 3,094 Nov-22-2021, 01:38 AM
Last Post: BrandonKastning
  Beautiful Soup - access a rating value in a class KatMac 1 3,421 Apr-16-2021, 01:27 PM
Last Post: snippsat
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,536 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  *Beginner* web scraping/Beautiful Soup help 7ken8 2 2,561 Jan-28-2021, 04:26 PM
Last Post: 7ken8
  Beautiful Soap can't find a specific section on the page Pavel_47 1 2,387 Jan-18-2021, 02:18 PM
Last Post: snippsat
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,623 Oct-01-2020, 02:19 PM
Last Post: snippsat
  Beautiful Soup (suddenly) doesn't get full webpage html j.crater 8 16,399 Jul-11-2020, 04:31 PM
Last Post: j.crater
  Requests-HTML vs Beautiful Soup - How to Choose? robin73 0 3,781 Jun-23-2020, 02:53 PM
Last Post: robin73
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,329 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  looking for direction - scrappy, crawler, beautiful soup Sly_Corn 2 2,404 Mar-17-2020, 03:17 PM
Last Post: Sly_Corn

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020