May-01-2021, 02:00 PM
(Apr-30-2021, 04:50 PM)spacedog Wrote: How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.It's easier to html2text on the found section when there are used different tags and want text from that.
import html2text data = '''\ <dd> A block of text here.... bla bla bla.... <ul> <li><p>Item 1. for some reason they wraped this in a p</p></li> <li><strong>And this item is important</strong>bla bla bla</li> <li>And just more info here...</li> </ul> And finally more stuff here... </dd>''' text = html2text.HTML2Text() text.mark_code = True text.ignore_emphasis = True text.single_line_break = True text.ignore_links = True text = text.handle(data) print(text.strip())
Output:A block of text here.... bla bla bla....
* Item 1. for some reason they wraped this in a p
* And this item is importantbla bla bla
* And just more info here...
And finally more stuff here...
There is many option if set ignore_emphasis = False
strong tag will be **
.So then if i eg want new line when there is strong tag.
print(text.strip().replace('**', '\n'))
Output:A block of text here.... bla bla bla....
* Item 1. for some reason they wraped this in a p
*
And this item is important
bla bla bla
* And just more info here...
And finally more stuff here...