Beautiful Soup - Delete All HTML - Except Specific Classes - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Beautiful Soup - Delete All HTML - Except Specific Classes (/thread-11508.html) |
Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-12-2018 Hi all, I have been looking everyhwere for this concept. I wanted to delete all html except for the classes. I have listed The idea is below. The code is not correct html = '''\ <h2 class="1">section1</h2> <p class="2">article1</p> <p>article2</p> <p class="3">article3</p> <h1> Lorem Ipsum</h1> <p> 3 Lorem ipsum dolor </p>",'lxml') ''' soup = BeautifulSoup(html, 'lxml') for tag in soup(): if not class in ["1", "2"]: tag.decompose() print(soup)I cant find any code samples to show me this idea Basically delete all html except for those classes listed in the list Result: Everything Deleted except: <h2 class="1">section1</h2> <p class="2">article1</p> <p class="3">article3</p> please do advise thank you RE: Beautiful Soup - Delete All HTML - Except Specific Classes - snippsat - Jul-12-2018 Can use extract() and set class to True.from bs4 import BeautifulSoup html = '''\ <h2 class="1">section1</h2> <p class="2">article1</p> <p>article2</p> <p class="3">article3</p> <h1> Lorem Ipsum</h1> <p> 3 Lorem ipsum dolor </p>",'lxml') ''' soup = BeautifulSoup(html, 'lxml') for tag in soup.find_all(['h2', 'p'], class_=True): print(tag.extract())
RE: Beautiful Soup - Delete All HTML - Except Specific Classes - nilamo - Jul-12-2018 .select() lets you do css selectors, so you can just use that: >>> import bs4 >>> html = ''' ... <h2 class="1">section1</h2> ... <p class="2">article1</p> ... <p>article2</p> ... <p class="3">article3</p> ... <h1> Lorem Ipsum</h1> ... <p> 3 Lorem ipsum dolor </p>",'lxml') ... ''' >>> >>> soup = bs4.BeautifulSoup(html, 'html.parser') >>> soup.select(".1, .2") [<h2 class="1">section1</h2>, <p class="2">article1</p>] RE: Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-12-2018 Thank you for these ideas - let me do some testing and pop back (Jul-12-2018, 04:00 PM)snippsat Wrote: Can use Thank you for this idea, is it possible for me to make a list out of this for tag in soup.find_all(['h2', 'p'], class_=True): example a list of classes for tag in soup.find_all([['h2', 'p'], class_='1','2']):So I may store these in a whitelist. Then i can extract these or decompose the ones not in my list. What i am trying to do basically is just get rid of everything unless it happens to be in my whitelist List of Classes. RE: Beautiful Soup - Delete All HTML - Except Specific Classes - nilamo - Jul-12-2018 (Jul-12-2018, 04:36 PM)dj99 Wrote: What i am trying to do basically is just get rid of everything Don't think about deleting or removing anything, as that's unnecessarily complex. Just only get what you want. .select() will be much easier for this.
RE: Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-12-2018 Hello N, Ok i guess I could do that. if the latter is more complex. I will do some coding and see if that works better RE: Beautiful Soup - Delete All HTML - Except Specific Classes - snippsat - Jul-12-2018 (Jul-12-2018, 04:36 PM)dj99 Wrote: is it possible for me to make a list out of this lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=True)]Or can also mix in regex for just 1 and 2. lst = [tag.extract() for tag in soup.findAll(['h2', 'p'], class_=re.compile("[1,2]"))] .select() as mention is also fine for this/better CSS Selector Reference.Also as mention bye @nilamo delete can be confusing(as i felt for),just extract what you want. RE: Beautiful Soup - Delete All HTML - Except Specific Classes - dj99 - Jul-13-2018 thank you for all the help that is another idea i can work with Have a great weekend all :) |