[Python 3] - Extract specific data from a web page using lxml module

[Python 3] - Extract specific data from a web page using lxml module - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: [Python 3] - Extract specific data from a web page using lxml module (/thread-12414.html)

[Python 3] - Extract specific data from a web page using lxml module - Takeshio - Aug-23-2018

Hi guys,

I am trying to write a Python 3 code (using lxml module) to extract some specific data from a webpage.

A sample of the HTML data presented in the webpage is as below.
______________________________________________________________

<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>

______________________________________________________________

My code:

from lxml import html
import requests

page = requests.get("http://some_website.aspx")
tree = html.fromstring(page.content)

var_1 = tree.xpath('//span[@class="number blue"]/text()')
print(var_1)

______________________________________________________________

I am able to extract the first data (i.e. xx) and store into "var_1". However, I would also need to extract the data that are within the <td> tags of the class "number blue", and store it.

Appreciate it if someone can help to advise on this problem. Thank you.

RE: [Python 3] - Extract specific data from a web page using lxml module - Takeshio - Aug-23-2018

Sorry mate. I didn’t notice that python code tags feature as I was using an iPhone to post this question (so the screen is quite small).

I will take note of your advice but right now, can anyone please advise me on how to solve this issue?

RE: [Python 3] - Extract specific data from a web page using lxml module - snippsat - Aug-23-2018

Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.

from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))

Output:xx
number blue

RE: [Python 3] - Extract specific data from a web page using lxml module - nilamo - Aug-23-2018

https://stackoverflow.com/a/1732454

That's html. HTML isn't xml, so why would you use an xml parser for html? Maybe it works some of the time, but will it work all the time?

RE: [Python 3] - Extract specific data from a web page using lxml module - snippsat - Aug-23-2018

That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.

lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

RE: [Python 3] - Extract specific data from a web page using lxml module - Takeshio - Aug-23-2018

(Aug-23-2018, 08:01 PM)snippsat Wrote: That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.

lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that, so that it will help me to understand better, and also, I’m really a noob who just started to learn Python.

Thank you.

RE: [Python 3] - Extract specific data from a web page using lxml module - snippsat - Aug-23-2018

(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that

Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.

RE: [Python 3] - Extract specific data from a web page using lxml module - Takeshio - Aug-24-2018

(Aug-23-2018, 07:01 PM)snippsat Wrote: Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.

from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))

Output:xx
number blue

Thanks for your reply. However, I want to get the two values (i.e. 001 and 002) within the <td> tags. They all belong to the same span class (i.e. number blue).

Any idea how to get these values neatly?

RE: [Python 3] - Extract specific data from a web page using lxml module - Takeshio - Aug-24-2018

(Aug-23-2018, 11:56 PM)snippsat Wrote:
(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.

Thanks Snippat! I have managed to parsed the HTML data using BeautifulSoup and Lxml, and stored them in a list. Now I will need to find some ways to process and structure these data in the list. Hope it’s not too hard. :-)

RE: [Python 3] - Extract specific data from a web page using lxml module - leotrubach - Aug-25-2018

Using xpath() method of ElementTree you could query all td elements without span child like this:

from lxml import html

html_text = """<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>"""


et = html.fromstring(html_text)
spans = et.xpath('//tr/td/span[@class="number blue"]')
print(spans[0].text)
for e in et.xpath('//tr/td[not(span)]'):
    print(e.text)