Need help with lxml.html and xpath

spacedog · Apr-29-2021, 10:58 PM

I was using scrapy to create the needed xpath for a lot of elements to scrape. Now that we're using multithreading I moved off of scrapy and just using lxml.html and the text coming off of response.text:

data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")

This xpath returned a blob of text from scrapy. but now it returns a list like this:

Services_Product[]

Which needs more work. this field is a "dd" element and one of the problems is sometimes it is:
<dd>some text</dd>
or
<dd>some text</dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
and maybe other things too.

Originally I was using a cloud service and all I had to do was provide an xpath and evidently it did things behind the scenes to make this work, and it seems scrapy does the same thing.

What is the best practice for extracting text from situations like this where the target field can be a number of different things?

running some test code to see what my options are I started with this:

file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
    print(elem[0][0].text)

which returned this:
Health
Health
doctors
Health
doctors

but using this expression in google chrome's xpath tool:

//dt[text()='Services/Products']/following-sibling::dd[1]

returns this (similar to scrapy):
Walk In ClinicDNA Paternity TestingGenetic TestingWellness ProgramsSenior Citizen WellnessBlood TestsFree ClinicsHealth CoachesEmployee Health ProgramsToxicology LabsFlu Shots

and this is the URL so you can see for your self:
https://www.yellowpages.com/nationwide/m...2050417627

What is the best way to approach and do this?
Thanks.

**Larz60+** · Apr-30-2021, 12:36 PM

There are other methods, you should spend a small amount of time here:
web scraping part 1
web scraping part 2

spacedog · Apr-30-2021, 02:49 PM

Thank you @Larz60, but those use bueatifulsoup which is too slow. I have to build something optimized as much as possible.

Any other recommendations on for the above question?

***snippsat*** · (This post was last modified: Apr-30-2021, 03:29 PM by snippsat.)

(Apr-30-2021, 02:49 PM)spacedog Wrote: but those use bueatifulsoup which is too slow.

Can use lxml as parser in BS,i always do that.

soup = BeautifulSoup(response.content, 'lxml')

For your task if find ul XPath under products then can iterate over all li which will be children of that tag.

from lxml import html
import requests

url = 'https://www.yellowpages.com/nationwide/mip/toxglobal-diagnostics-llc-556885209?lid=1002050417627'
resonse = requests.get(url)
tree = html.fromstring(resonse.content)
prod = tree.xpath('//*[@id="business-info"]/dl/dd[3]/ul')
for tag in prod[0].getchildren():
    print(tag.text)

Output:Walk In Clinic
DNA Paternity Testing
Genetic Testing
Wellness Programs
Senior Citizen Wellness
Blood Tests
Free Clinics
Health Coaches
Employee Health Programs
Toxicology Labs
Flu Shots

spacedog · (This post was last modified: Apr-30-2021, 04:50 PM by spacedog.)

Thank you. If I know the content is a list, I can extract that pretty easily. My problem is for some of these fields (target areas to extract such as “business-info” or “products/services”) the content can be a variety including the sample below:

data = '''
    /<dd>
      A block of text here.... bla bla bla....
      <ul>
        <li><p>Item 1.  for some reason they wraped this in a p</p></li>
        <li><strong>And this item is important</strong>bla bla bla</li>
        <li>And just more info here...</li>
      </ul>
      And finally more stuff here...
    </dd>'''

The trouble I’m having is that I can extract the above section with this:

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")

But now that I have it I’m having trouble parsing out the text where “p” and “li” should be separated with \r\n giving this result:

A block of text here.... bla bla bla....

Item 1. for some reason they wrapped this in a p

And this item is important bla bla bla
And just more info here...
And finally more stuff here...

And I’m sure there are a few other element types that will show up which I did not demonstrate above.

How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.

Much thanks for taking time out to look at this!

***snippsat*** · May-01-2021, 02:00 PM

(Apr-30-2021, 04:50 PM)spacedog Wrote: How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.

It's easier to html2text on the found section when there are used different tags and want text from that.

import html2text

data = '''\
<dd>
  A block of text here.... bla bla bla....
  <ul>
    <li><p>Item 1.  for some reason they wraped this in a p</p></li>
    <li><strong>And this item is important</strong>bla bla bla</li>
    <li>And just more info here...</li>
  </ul>
  And finally more stuff here...
</dd>'''

text = html2text.HTML2Text()
text.mark_code = True
text.ignore_emphasis = True
text.single_line_break = True
text.ignore_links = True
text = text.handle(data)
print(text.strip())

Output:A block of text here.... bla bla bla.... 
  * Item 1. for some reason they wraped this in a p
  * And this item is importantbla bla bla
  * And just more info here...

And finally more stuff here...

There is many option if set ignore_emphasis = False strong tag will be **.
So then if i eg want new line when there is strong tag.

print(text.strip().replace('**', '\n'))

Output:A block of text here.... bla bla bla.... 
  * Item 1. for some reason they wraped this in a p
  * 
And this item is important
 bla bla bla
  * And just more info here...

And finally more stuff here...

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row	AaronCatolico1	0	923	Dec-25-2022, 06:28 PM Last Post: AaronCatolico1
	reading html and edit chekcbox to html	jacklee26	5	3,074	Jul-01-2021, 10:31 AM Last Post: snippsat
	HTML to Python to Windows .bat and back to HTML	perfectservice33	0	1,944	Aug-22-2019, 06:31 AM Last Post: perfectservice33
	lxml - etree/lxml need help storing variable for most inserted element	goeb	0	2,555	Apr-01-2019, 03:09 AM Last Post: goeb

Need help with lxml.html and xpath

User Panel Messages

Announcements