Python Forum
Need help with lxml.html and xpath
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with lxml.html and xpath
#5
Thank you. If I know the content is a list, I can extract that pretty easily. My problem is for some of these fields (target areas to extract such as “business-info” or “products/services”) the content can be a variety including the sample below:

data = '''
    /<dd>
      A block of text here.... bla bla bla....
      <ul>
        <li><p>Item 1.  for some reason they wraped this in a p</p></li>
        <li><strong>And this item is important</strong>bla bla bla</li>
        <li>And just more info here...</li>
      </ul>
      And finally more stuff here...
    </dd>'''
The trouble I’m having is that I can extract the above section with this:

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
But now that I have it I’m having trouble parsing out the text where “p” and “li” should be separated with \r\n giving this result:

A block of text here.... bla bla bla....

Item 1. for some reason they wrapped this in a p

And this item is important bla bla bla
And just more info here...
And finally more stuff here...



And I’m sure there are a few other element types that will show up which I did not demonstrate above.

How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.

Much thanks for taking time out to look at this!
Reply


Messages In This Thread
Need help with lxml.html and xpath - by spacedog - Apr-29-2021, 10:58 PM
RE: Need help with lxml.html and xpath - by Larz60+ - Apr-30-2021, 12:36 PM
RE: Need help with lxml.html and xpath - by spacedog - Apr-30-2021, 04:50 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 969 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  reading html and edit chekcbox to html jacklee26 5 3,135 Jul-01-2021, 10:31 AM
Last Post: snippsat
  HTML to Python to Windows .bat and back to HTML perfectservice33 0 1,976 Aug-22-2019, 06:31 AM
Last Post: perfectservice33
  lxml - etree/lxml need help storing variable for most inserted element goeb 0 2,586 Apr-01-2019, 03:09 AM
Last Post: goeb

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020