Python Forum
[Python 3] - Extract specific data from a web page using lxml module
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Python 3] - Extract specific data from a web page using lxml module
#1
Hi guys,

I am trying to write a Python 3 code (using lxml module) to extract some specific data from a webpage.

A sample of the HTML data presented in the webpage is as below.
______________________________________________________________
<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>
______________________________________________________________

My code:
from lxml import html
import requests

page = requests.get("http://some_website.aspx")
tree = html.fromstring(page.content)

var_1 = tree.xpath('//span[@class="number blue"]/text()')
print(var_1)
______________________________________________________________

I am able to extract the first data (i.e. xx) and store into "var_1". However, I would also need to extract the data that are within the <td> tags of the class "number blue", and store it.

Appreciate it if someone can help to advise on this problem. Thank you.
Reply
#2
Sorry mate. I didn’t notice that python code tags feature as I was using an iPhone to post this question (so the screen is quite small).

I will take note of your advice but right now, can anyone please advise me on how to solve this issue?
Reply
#3
Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.
from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))
Output:
xx number blue
Reply
#4
https://stackoverflow.com/a/1732454

That's html. HTML isn't xml, so why would you use an xml parser for html? Maybe it works some of the time, but will it work all the time?
Reply
#5
That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.
lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
Reply
#6
(Aug-23-2018, 08:01 PM)snippsat Wrote: That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.
lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that, so that it will help me to understand better, and also, I’m really a noob who just started to learn Python.

Thank you.
Reply
#7
(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.
Reply
#8
(Aug-23-2018, 07:01 PM)snippsat Wrote: Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.
from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))
Output:
xx number blue

Thanks for your reply. However, I want to get the two values (i.e. 001 and 002) within the <td> tags. They all belong to the same span class (i.e. number blue).

Any idea how to get these values neatly?
Reply
#9
(Aug-23-2018, 11:56 PM)snippsat Wrote:
(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.

Thanks Snippat! I have managed to parsed the HTML data using BeautifulSoup and Lxml, and stored them in a list. Now I will need to find some ways to process and structure these data in the list. Hope it’s not too hard. :-)
Reply
#10
Using xpath() method of ElementTree you could query all td elements without span child like this:

from lxml import html

html_text = """<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>"""


et = html.fromstring(html_text)
spans = et.xpath('//tr/td/span[@class="number blue"]')
print(spans[0].text)
for e in et.xpath('//tr/td[not(span)]'):
    print(e.text)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  trying to save data automatically from this page thunderspeed 1 1,970 Sep-19-2021, 04:57 AM
Last Post: ndc85430
  Extract data from sports betting sites nestor 3 5,550 Mar-30-2021, 04:37 PM
Last Post: Larz60+
  Scraping a page with log in data (security, proxies) iamaghost 0 2,103 Mar-27-2021, 02:56 PM
Last Post: iamaghost
  DJANGO Looping Through Context Variable with specific data Taz 0 1,780 Feb-18-2021, 03:52 PM
Last Post: Taz
  Beautiful Soap can't find a specific section on the page Pavel_47 1 2,385 Jan-18-2021, 02:18 PM
Last Post: snippsat
  Extract data from a table Bob_M 3 2,627 Aug-14-2020, 03:36 PM
Last Post: Bob_M
  Extract data with Selenium and BeautifulSoup nestor 3 3,816 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Extract json-ld schema markup data and store in MongoDB Nuwan16 0 2,412 Apr-05-2020, 04:06 PM
Last Post: Nuwan16
  Extract data from a webpage cycloneseb 5 2,818 Apr-04-2020, 10:17 AM
Last Post: alekson
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,578 Mar-19-2020, 06:13 PM
Last Post: apollo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020