Apr-29-2021, 10:58 PM
I was using scrapy to create the needed xpath for a lot of elements to scrape. Now that we're using multithreading I moved off of scrapy and just using lxml.html and the text coming off of response.text:
Services_Product[]
Which needs more work. this field is a "dd" element and one of the problems is sometimes it is:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
and maybe other things too.
Originally I was using a cloud service and all I had to do was provide an xpath and evidently it did things behind the scenes to make this work, and it seems scrapy does the same thing.
What is the best practice for extracting text from situations like this where the target field can be a number of different things?
running some test code to see what my options are I started with this:
Health
Health
doctors
Health
doctors
but using this expression in google chrome's xpath tool:
Walk In ClinicDNA Paternity TestingGenetic TestingWellness ProgramsSenior Citizen WellnessBlood TestsFree ClinicsHealth CoachesEmployee Health ProgramsToxicology LabsFlu Shots
and this is the URL so you can see for your self:
https://www.yellowpages.com/nationwide/m...2050417627
What is the best way to approach and do this?
Thanks.
data = response.text tree = html.fromstring(data) Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")This xpath returned a blob of text from scrapy. but now it returns a list like this:
Services_Product[]
Which needs more work. this field is a "dd" element and one of the problems is sometimes it is:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
and maybe other things too.
Originally I was using a cloud service and all I had to do was provide an xpath and evidently it did things behind the scenes to make this work, and it seems scrapy does the same thing.
What is the best practice for extracting text from situations like this where the target field can be a number of different things?
running some test code to see what my options are I started with this:
file = open('html_01.txt', 'r') data = file.read() tree = html.fromstring(data) Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]") stuff = Services_Product[0].xpath("//li") for elem in stuff: print(elem[0][0].text)which returned this:
Health
Health
doctors
Health
doctors
but using this expression in google chrome's xpath tool:
//dt[text()='Services/Products']/following-sibling::dd[1]returns this (similar to scrapy):
Walk In ClinicDNA Paternity TestingGenetic TestingWellness ProgramsSenior Citizen WellnessBlood TestsFree ClinicsHealth CoachesEmployee Health ProgramsToxicology LabsFlu Shots
and this is the URL so you can see for your self:
https://www.yellowpages.com/nationwide/m...2050417627
What is the best way to approach and do this?
Thanks.