Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Beautiful Soup - Delete All HTML - Except Specific Classes
#1
Hi all,

I have been looking everyhwere for this concept.

I wanted to delete all html except for the classes. I have listed
The idea is below. The code is not correct

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')


for tag in soup():
   if not class in ["1", "2"]:
        tag.decompose()
print(soup)

I cant find any code samples to show me this idea

Basically delete all html except for those classes listed in the list

Result:
Everything Deleted except:
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p class="3">article3</p>

please do advise thank you
Quote
#2
Can use extract() and set class to True.
from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())
Output:
<h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p>
Quote
#3
.select() lets you do css selectors, so you can just use that:
>>> import bs4
>>> html = '''
... <h2 class="1">section1</h2>
... <p class="2">article1</p>
... <p>article2</p>
... <p class="3">article3</p>
... <h1> Lorem Ipsum</h1>
... <p> 3 Lorem ipsum dolor </p>",'lxml')
... '''
>>>
>>> soup = bs4.BeautifulSoup(html, 'html.parser')
>>> soup.select(".1, .2")
[<h2 class="1">section1</h2>, <p class="2">article1</p>]
Quote
#4
Thank you for these ideas - let me do some testing and pop back

(Jul-12-2018, 04:00 PM)snippsat Wrote: Can use extract() and set class to True.
from bs4 import BeautifulSoup

html = '''\
<h2 class="1">section1</h2>
<p class="2">article1</p>
<p>article2</p>
<p class="3">article3</p>
<h1> Lorem Ipsum</h1>
<p> 3 Lorem ipsum dolor </p>",'lxml')
'''

soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(['h2', 'p'], class_=True):
    print(tag.extract())
Output:
<h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p>

Thank you for this idea,

is it possible for me to make a list out of this



for tag in soup.find_all(['h2', 'p'], class_=True):
example a list of classes
for tag in soup.find_all([['h2', 'p'], class_='1','2']):
So I may store these in a whitelist.

Then i can extract these or decompose the ones not in my list.

What i am trying to do basically is just get rid of everything unless it happens to be in my whitelist List of Classes.
Quote
#5
(Jul-12-2018, 04:36 PM)dj99 Wrote: What i am trying to do basically is just get rid of everything

Don't think about deleting or removing anything, as that's unnecessarily complex. Just only get what you want. .select() will be much easier for this.
snippsat likes this post
Quote
#6
Hello N,

Ok i guess I could do that. if the latter is more complex.

I will do some coding and see if that works better
Quote
#7
(Jul-12-2018, 04:36 PM)dj99 Wrote: is it possible for me to make a list out of this
lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=True)]
Or can also mix in regex for just 1 and 2.
lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=re.compile("[1,2]"))]
.select() as mention is also fine for this/better CSS Selector Reference.
Also as mention bye @nilamo delete can be confusing(as i felt for),just extract what you want.
Quote
#8
thank you for all the help

that is another idea i can work with
Have a great weekend all
:)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Web crawler extracting specific text from HTML lewdow 1 606 Jan-03-2020, 11:21 PM
Last Post: snippsat
  How do I extract specific lines from HTML files before and after a word? glittergirl 1 2,297 Aug-06-2019, 07:23 AM
Last Post: fishhook
  Getting a specific text inside an html with soup mathieugrimbert 9 3,067 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  Beautiful soup and tags starter_student 11 916 Jul-08-2019, 03:41 PM
Last Post: starter_student
  Beautiful Soup find_all() kirito85 2 655 Jun-14-2019, 02:17 AM
Last Post: kirito85
  [split] Using beautiful soup to get html attribute value moski 6 1,097 Jun-03-2019, 04:24 PM
Last Post: moski
  Using beautiful soup to get html attribute value graham23s 2 3,398 Apr-23-2019, 09:21 PM
Last Post: graham23s
  Failure in web scraping by Beautiful Soup yeungcase 4 1,329 Mar-23-2019, 12:36 PM
Last Post: metulburr
  Beautiful soup won't find value even with CSS path copied. AdequatelyChilled 4 962 Jan-01-2019, 12:12 PM
Last Post: snippsat
  Need help with Beautiful Soup - table jlkmb 9 1,211 Dec-20-2018, 01:10 AM
Last Post: jlkmb

Forum Jump:


Users browsing this thread: 1 Guest(s)