Python Forum
Need help with lxml.html and xpath
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with lxml.html and xpath
#6
(Apr-30-2021, 04:50 PM)spacedog Wrote: How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.
It's easier to html2text on the found section when there are used different tags and want text from that.
import html2text

data = '''\
<dd>
  A block of text here.... bla bla bla....
  <ul>
    <li><p>Item 1.  for some reason they wraped this in a p</p></li>
    <li><strong>And this item is important</strong>bla bla bla</li>
    <li>And just more info here...</li>
  </ul>
  And finally more stuff here...
</dd>'''

text = html2text.HTML2Text()
text.mark_code = True
text.ignore_emphasis = True
text.single_line_break = True
text.ignore_links = True
text = text.handle(data)
print(text.strip())
Output:
A block of text here.... bla bla bla.... * Item 1. for some reason they wraped this in a p * And this item is importantbla bla bla * And just more info here... And finally more stuff here...
There is many option if set ignore_emphasis = False strong tag will be **.
So then if i eg want new line when there is strong tag.
print(text.strip().replace('**', '\n'))
Output:
A block of text here.... bla bla bla.... * Item 1. for some reason they wraped this in a p * And this item is important bla bla bla * And just more info here... And finally more stuff here...
Reply


Messages In This Thread
Need help with lxml.html and xpath - by spacedog - Apr-29-2021, 10:58 PM
RE: Need help with lxml.html and xpath - by Larz60+ - Apr-30-2021, 12:36 PM
RE: Need help with lxml.html and xpath - by snippsat - May-01-2021, 02:00 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 1,029 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  reading html and edit chekcbox to html jacklee26 5 3,223 Jul-01-2021, 10:31 AM
Last Post: snippsat
  HTML to Python to Windows .bat and back to HTML perfectservice33 0 2,006 Aug-22-2019, 06:31 AM
Last Post: perfectservice33
  lxml - etree/lxml need help storing variable for most inserted element goeb 0 2,610 Apr-01-2019, 03:09 AM
Last Post: goeb

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020