Need help with lxml.html and xpath

spacedog · (This post was last modified: Apr-30-2021, 04:50 PM by spacedog.)

Thank you. If I know the content is a list, I can extract that pretty easily. My problem is for some of these fields (target areas to extract such as “business-info” or “products/services”) the content can be a variety including the sample below:

data = '''
    /<dd>
      A block of text here.... bla bla bla....
      <ul>
        <li><p>Item 1.  for some reason they wraped this in a p</p></li>
        <li><strong>And this item is important</strong>bla bla bla</li>
        <li>And just more info here...</li>
      </ul>
      And finally more stuff here...
    </dd>'''

The trouble I’m having is that I can extract the above section with this:

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")

But now that I have it I’m having trouble parsing out the text where “p” and “li” should be separated with \r\n giving this result:

A block of text here.... bla bla bla....

Item 1. for some reason they wrapped this in a p

And this item is important bla bla bla
And just more info here...
And finally more stuff here...

And I’m sure there are a few other element types that will show up which I did not demonstrate above.

How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.

Much thanks for taking time out to look at this!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row	AaronCatolico1	0	969	Dec-25-2022, 06:28 PM Last Post: AaronCatolico1
	reading html and edit chekcbox to html	jacklee26	5	3,135	Jul-01-2021, 10:31 AM Last Post: snippsat
	HTML to Python to Windows .bat and back to HTML	perfectservice33	0	1,976	Aug-22-2019, 06:31 AM Last Post: perfectservice33
	lxml - etree/lxml need help storing variable for most inserted element	goeb	0	2,586	Apr-01-2019, 03:09 AM Last Post: goeb

Need help with lxml.html and xpath

User Panel Messages

Announcements