Need help with lxml.html and xpath

***snippsat*** · May-01-2021, 02:00 PM

(Apr-30-2021, 04:50 PM)spacedog Wrote: How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.

It's easier to html2text on the found section when there are used different tags and want text from that.

import html2text

data = '''\
<dd>
  A block of text here.... bla bla bla....
  <ul>
    <li><p>Item 1.  for some reason they wraped this in a p</p></li>
    <li><strong>And this item is important</strong>bla bla bla</li>
    <li>And just more info here...</li>
  </ul>
  And finally more stuff here...
</dd>'''

text = html2text.HTML2Text()
text.mark_code = True
text.ignore_emphasis = True
text.single_line_break = True
text.ignore_links = True
text = text.handle(data)
print(text.strip())

Output:A block of text here.... bla bla bla.... 
  * Item 1. for some reason they wraped this in a p
  * And this item is importantbla bla bla
  * And just more info here...

And finally more stuff here...

There is many option if set ignore_emphasis = False strong tag will be **.
So then if i eg want new line when there is strong tag.

print(text.strip().replace('**', '\n'))

Output:A block of text here.... bla bla bla.... 
  * Item 1. for some reason they wraped this in a p
  * 
And this item is important
 bla bla bla
  * And just more info here...

And finally more stuff here...

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row	AaronCatolico1	0	1,029	Dec-25-2022, 06:28 PM Last Post: AaronCatolico1
	reading html and edit chekcbox to html	jacklee26	5	3,223	Jul-01-2021, 10:31 AM Last Post: snippsat
	HTML to Python to Windows .bat and back to HTML	perfectservice33	0	2,006	Aug-22-2019, 06:31 AM Last Post: perfectservice33
	lxml - etree/lxml need help storing variable for most inserted element	goeb	0	2,610	Apr-01-2019, 03:09 AM Last Post: goeb

Need help with lxml.html and xpath

User Panel Messages

Announcements