Thank you. If I know the content is a list, I can extract that pretty easily. My problem is for some of these fields (target areas to extract such as “business-info” or “products/services”) the content can be a variety including the sample below:
A block of text here.... bla bla bla....
Item 1. for some reason they wrapped this in a p
And this item is important bla bla bla
And just more info here...
And finally more stuff here...
And I’m sure there are a few other element types that will show up which I did not demonstrate above.
How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.
Much thanks for taking time out to look at this!
data = ''' /<dd> A block of text here.... bla bla bla.... <ul> <li><p>Item 1. for some reason they wraped this in a p</p></li> <li><strong>And this item is important</strong>bla bla bla</li> <li>And just more info here...</li> </ul> And finally more stuff here... </dd>'''The trouble I’m having is that I can extract the above section with this:
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")But now that I have it I’m having trouble parsing out the text where “p” and “li” should be separated with \r\n giving this result:
A block of text here.... bla bla bla....
Item 1. for some reason they wrapped this in a p
And this item is important bla bla bla
And just more info here...
And finally more stuff here...
And I’m sure there are a few other element types that will show up which I did not demonstrate above.
How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.
Much thanks for taking time out to look at this!