Hi guys,
I am trying to write a Python 3 code (using lxml module) to extract some specific data from a webpage.
A sample of the HTML data presented in the webpage is as below.
______________________________________________________________
<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>
______________________________________________________________
My code:
from lxml import html
import requests
page = requests.get("http://some_website.aspx")
tree = html.fromstring(page.content)
var_1 = tree.xpath('//span[@class="number blue"]/text()')
print(var_1)
______________________________________________________________
I am able to extract the first data (i.e. xx) and store into "var_1". However, I would also need to extract the data that are within the <td> tags of the class "number blue", and store it.
Appreciate it if someone can help to advise on this problem. Thank you.
Sorry mate. I didn’t notice that python code tags feature as I was using an iPhone to post this question (so the screen is quite small).
I will take note of your advice but right now, can anyone please advise me on how to solve this issue?
Remove
text()
from Xpath,can use
.text
from lxml.
Now can also take out
.attrib
from CSS class.
from lxml import etree
# Simulate a web page
html = '''\
<html>
<head>
<title>foo</title>
</head>
<body>
<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>>
</body>
</html>'''
tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))
Output:
xx
number blue
That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.
lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
(Aug-23-2018, 08:01 PM)snippsat Wrote: [ -> ]That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.
lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.
Can you guys please provide me a code snippet of how to achieve that, so that it will help me to understand better, and also, I’m really a noob who just started to learn Python.
Thank you.
(Aug-23-2018, 11:47 PM)Takeshio Wrote: [ -> ]I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.
Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in
Web-scraping part-2 like using Selenium.
(Aug-23-2018, 07:01 PM)snippsat Wrote: [ -> ]Remove text()
from Xpath,can use .text
from lxml.
Now can also take out .attrib
from CSS class.
from lxml import etree
# Simulate a web page
html = '''\
<html>
<head>
<title>foo</title>
</head>
<body>
<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>>
</body>
</html>'''
tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))
Output:
xx
number blue
Thanks for your reply. However, I want to get the two values (i.e. 001 and 002) within the <td> tags. They all belong to the same span class (i.e. number blue).
Any idea how to get these values neatly?
(Aug-23-2018, 11:56 PM)snippsat Wrote: [ -> ] (Aug-23-2018, 11:47 PM)Takeshio Wrote: [ -> ]I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.
Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.
Thanks Snippat! I have managed to parsed the HTML data using BeautifulSoup and Lxml, and stored them in a list. Now I will need to find some ways to process and structure these data in the list. Hope it’s not too hard. :-)
Using
xpath()
method of ElementTree you could query all td elements without span child like this:
from lxml import html
html_text = """<html>
<head>
<title>foo</title>
</head>
<body>
<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>>
</body>
</html>"""
et = html.fromstring(html_text)
spans = et.xpath('//tr/td/span[@class="number blue"]')
print(spans[0].text)
for e in et.xpath('//tr/td[not(span)]'):
print(e.text)