[Python 3] - Extract specific data from a web page using lxml module

Takeshio · (This post was last modified: Aug-23-2018, 05:40 PM by j.crater.)

Hi guys,

I am trying to write a Python 3 code (using lxml module) to extract some specific data from a webpage.

A sample of the HTML data presented in the webpage is as below.
______________________________________________________________

<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>

______________________________________________________________

My code:

from lxml import html
import requests

page = requests.get("http://some_website.aspx")
tree = html.fromstring(page.content)

var_1 = tree.xpath('//span[@class="number blue"]/text()')
print(var_1)

______________________________________________________________

I am able to extract the first data (i.e. xx) and store into "var_1". However, I would also need to extract the data that are within the <td> tags of the class "number blue", and store it.

Appreciate it if someone can help to advise on this problem. Thank you.

Takeshio · Aug-23-2018, 06:41 PM

Sorry mate. I didn’t notice that python code tags feature as I was using an iPhone to post this question (so the screen is quite small).

I will take note of your advice but right now, can anyone please advise me on how to solve this issue?

***snippsat*** · (This post was last modified: Aug-23-2018, 07:02 PM by snippsat.)

Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.

from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))

Output:xx
number blue

**nilamo** · Aug-23-2018, 07:46 PM

https://stackoverflow.com/a/1732454

That's html. HTML isn't xml, so why would you use an xml parser for html? Maybe it works some of the time, but will it work all the time?

***snippsat*** · (This post was last modified: Aug-23-2018, 08:01 PM by snippsat.)

That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.

lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Takeshio · Aug-23-2018, 11:47 PM

(Aug-23-2018, 08:01 PM)snippsat Wrote: That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.

lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that, so that it will help me to understand better, and also, I’m really a noob who just started to learn Python.

Thank you.

***snippsat*** · (This post was last modified: Aug-23-2018, 11:56 PM by snippsat.)

(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that

Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.

Takeshio · Aug-24-2018, 02:13 AM

(Aug-23-2018, 07:01 PM)snippsat Wrote: Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.

from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))

Output:xx
number blue

Thanks for your reply. However, I want to get the two values (i.e. 001 and 002) within the <td> tags. They all belong to the same span class (i.e. number blue).

Any idea how to get these values neatly?

Takeshio · Aug-24-2018, 07:19 AM

(Aug-23-2018, 11:56 PM)snippsat Wrote:
(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.

Thanks Snippat! I have managed to parsed the HTML data using BeautifulSoup and Lxml, and stored them in a list. Now I will need to find some ways to process and structure these data in the list. Hope it’s not too hard. :-)

leotrubach · Aug-25-2018, 08:46 AM

Using xpath() method of ElementTree you could query all td elements without span child like this:

from lxml import html

html_text = """<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>"""


et = html.fromstring(html_text)
spans = et.xpath('//tr/td/span[@class="number blue"]')
print(spans[0].text)
for e in et.xpath('//tr/td[not(span)]'):
    print(e.text)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	trying to save data automatically from this page	thunderspeed	1	2,022	Sep-19-2021, 04:57 AM Last Post: ndc85430
	Extract data from sports betting sites	nestor	3	5,655	Mar-30-2021, 04:37 PM Last Post: Larz60+
	Scraping a page with log in data (security, proxies)	iamaghost	0	2,151	Mar-27-2021, 02:56 PM Last Post: iamaghost
	DJANGO Looping Through Context Variable with specific data	Taz	0	1,833	Feb-18-2021, 03:52 PM Last Post: Taz
	Beautiful Soap can't find a specific section on the page	Pavel_47	1	2,435	Jan-18-2021, 02:18 PM Last Post: snippsat
	Extract data from a table	Bob_M	3	2,698	Aug-14-2020, 03:36 PM Last Post: Bob_M
	Extract data with Selenium and BeautifulSoup	nestor	3	3,931	Jun-06-2020, 01:34 AM Last Post: Larz60+
	Extract json-ld schema markup data and store in MongoDB	Nuwan16	0	2,468	Apr-05-2020, 04:06 PM Last Post: Nuwan16
	Extract data from a webpage	cycloneseb	5	2,887	Apr-04-2020, 10:17 AM Last Post: alekson
	use Xpath in Python :: libxml2 for a page-to-page skip-setting	apollo	2	3,642	Mar-19-2020, 06:13 PM Last Post: apollo

[Python 3] - Extract specific data from a web page using lxml module

User Panel Messages

Announcements