Python Forum
Need help with lxml.html and xpath
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with lxml.html and xpath
#1
I was using scrapy to create the needed xpath for a lot of elements to scrape. Now that we're using multithreading I moved off of scrapy and just using lxml.html and the text coming off of response.text:

data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
This xpath returned a blob of text from scrapy. but now it returns a list like this:

Services_Product[]

Which needs more work. this field is a "dd" element and one of the problems is sometimes it is:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
and maybe other things too.

Originally I was using a cloud service and all I had to do was provide an xpath and evidently it did things behind the scenes to make this work, and it seems scrapy does the same thing.

What is the best practice for extracting text from situations like this where the target field can be a number of different things?

running some test code to see what my options are I started with this:
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
    print(elem[0][0].text)
which returned this:
Health
Health
doctors
Health
doctors

but using this expression in google chrome's xpath tool:
//dt[text()='Services/Products']/following-sibling::dd[1]
returns this (similar to scrapy):
Walk In ClinicDNA Paternity TestingGenetic TestingWellness ProgramsSenior Citizen WellnessBlood TestsFree ClinicsHealth CoachesEmployee Health ProgramsToxicology LabsFlu Shots

and this is the URL so you can see for your self:
https://www.yellowpages.com/nationwide/m...2050417627

What is the best way to approach and do this?
Thanks.
Reply


Messages In This Thread
Need help with lxml.html and xpath - by spacedog - Apr-29-2021, 10:58 PM
RE: Need help with lxml.html and xpath - by Larz60+ - Apr-30-2021, 12:36 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 1,029 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  reading html and edit chekcbox to html jacklee26 5 3,223 Jul-01-2021, 10:31 AM
Last Post: snippsat
  HTML to Python to Windows .bat and back to HTML perfectservice33 0 2,006 Aug-22-2019, 06:31 AM
Last Post: perfectservice33
  lxml - etree/lxml need help storing variable for most inserted element goeb 0 2,610 Apr-01-2019, 03:09 AM
Last Post: goeb

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020