Python Forum
Need help with lxml.html and xpath
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with lxml.html and xpath
#1
I was using scrapy to create the needed xpath for a lot of elements to scrape. Now that we're using multithreading I moved off of scrapy and just using lxml.html and the text coming off of response.text:

data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
This xpath returned a blob of text from scrapy. but now it returns a list like this:

Services_Product[]

Which needs more work. this field is a "dd" element and one of the problems is sometimes it is:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
and maybe other things too.

Originally I was using a cloud service and all I had to do was provide an xpath and evidently it did things behind the scenes to make this work, and it seems scrapy does the same thing.

What is the best practice for extracting text from situations like this where the target field can be a number of different things?

running some test code to see what my options are I started with this:
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
    print(elem[0][0].text)
which returned this:
Health
Health
doctors
Health
doctors

but using this expression in google chrome's xpath tool:
//dt[text()='Services/Products']/following-sibling::dd[1]
returns this (similar to scrapy):
Walk In ClinicDNA Paternity TestingGenetic TestingWellness ProgramsSenior Citizen WellnessBlood TestsFree ClinicsHealth CoachesEmployee Health ProgramsToxicology LabsFlu Shots

and this is the URL so you can see for your self:
https://www.yellowpages.com/nationwide/m...2050417627

What is the best way to approach and do this?
Thanks.
Reply
#2
There are other methods, you should spend a small amount of time here:
web scraping part 1
web scraping part 2
Reply
#3
Thank you @Larz60, but those use bueatifulsoup which is too slow. I have to build something optimized as much as possible.

Any other recommendations on for the above question?
Reply
#4
(Apr-30-2021, 02:49 PM)spacedog Wrote: but those use bueatifulsoup which is too slow.
Can use lxml as parser in BS,i always do that.
soup = BeautifulSoup(response.content, 'lxml')
For your task if find ul XPath under products then can iterate over all li which will be children of that tag.
from lxml import html
import requests

url = 'https://www.yellowpages.com/nationwide/mip/toxglobal-diagnostics-llc-556885209?lid=1002050417627'
resonse = requests.get(url)
tree = html.fromstring(resonse.content)
prod = tree.xpath('//*[@id="business-info"]/dl/dd[3]/ul')
for tag in prod[0].getchildren():
    print(tag.text)
Output:
Walk In Clinic DNA Paternity Testing Genetic Testing Wellness Programs Senior Citizen Wellness Blood Tests Free Clinics Health Coaches Employee Health Programs Toxicology Labs Flu Shots
Reply
#5
Thank you. If I know the content is a list, I can extract that pretty easily. My problem is for some of these fields (target areas to extract such as “business-info” or “products/services”) the content can be a variety including the sample below:

data = '''
    /<dd>
      A block of text here.... bla bla bla....
      <ul>
        <li><p>Item 1.  for some reason they wraped this in a p</p></li>
        <li><strong>And this item is important</strong>bla bla bla</li>
        <li>And just more info here...</li>
      </ul>
      And finally more stuff here...
    </dd>'''
The trouble I’m having is that I can extract the above section with this:

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
But now that I have it I’m having trouble parsing out the text where “p” and “li” should be separated with \r\n giving this result:

A block of text here.... bla bla bla....

Item 1. for some reason they wrapped this in a p

And this item is important bla bla bla
And just more info here...
And finally more stuff here...



And I’m sure there are a few other element types that will show up which I did not demonstrate above.

How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.

Much thanks for taking time out to look at this!
Reply
#6
(Apr-30-2021, 04:50 PM)spacedog Wrote: How can I efficiently process the above block of html not knowing it's structure? I'm looking for a generic method that can uniformly extract a variety of html structures as shown above.
It's easier to html2text on the found section when there are used different tags and want text from that.
import html2text

data = '''\
<dd>
  A block of text here.... bla bla bla....
  <ul>
    <li><p>Item 1.  for some reason they wraped this in a p</p></li>
    <li><strong>And this item is important</strong>bla bla bla</li>
    <li>And just more info here...</li>
  </ul>
  And finally more stuff here...
</dd>'''

text = html2text.HTML2Text()
text.mark_code = True
text.ignore_emphasis = True
text.single_line_break = True
text.ignore_links = True
text = text.handle(data)
print(text.strip())
Output:
A block of text here.... bla bla bla.... * Item 1. for some reason they wraped this in a p * And this item is importantbla bla bla * And just more info here... And finally more stuff here...
There is many option if set ignore_emphasis = False strong tag will be **.
So then if i eg want new line when there is strong tag.
print(text.strip().replace('**', '\n'))
Output:
A block of text here.... bla bla bla.... * Item 1. for some reason they wraped this in a p * And this item is important bla bla bla * And just more info here... And finally more stuff here...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 923 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  reading html and edit chekcbox to html jacklee26 5 3,074 Jul-01-2021, 10:31 AM
Last Post: snippsat
  HTML to Python to Windows .bat and back to HTML perfectservice33 0 1,944 Aug-22-2019, 06:31 AM
Last Post: perfectservice33
  lxml - etree/lxml need help storing variable for most inserted element goeb 0 2,555 Apr-01-2019, 03:09 AM
Last Post: goeb

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020