Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 [Python 3] - Extract specific data from a web page using lxml module
#1
Hi guys,

I am trying to write a Python 3 code (using lxml module) to extract some specific data from a webpage.

A sample of the HTML data presented in the webpage is as below.
______________________________________________________________
<tr>
<td><span class="number blue">xx</span></td>
<td>001</td>
<td>002</td>
</tr>
______________________________________________________________

My code:
from lxml import html
import requests

page = requests.get("http://some_website.aspx")
tree = html.fromstring(page.content)

var_1 = tree.xpath('//span[@class="number blue"]/text()')
print(var_1)
______________________________________________________________

I am able to extract the first data (i.e. xx) and store into "var_1". However, I would also need to extract the data that are within the <td> tags of the class "number blue", and store it.

Appreciate it if someone can help to advise on this problem. Thank you.
j.crater wrote Aug-23-2018, 05:40 PM:
Use Python code tags next time when posting code.
Quote
#2
Sorry mate. I didn’t notice that python code tags feature as I was using an iPhone to post this question (so the screen is quite small).

I will take note of your advice but right now, can anyone please advise me on how to solve this issue?
Quote
#3
Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.
from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))
Output:
xx number blue
Quote
#4
https://stackoverflow.com/a/1732454

That's html. HTML isn't xml, so why would you use an xml parser for html? Maybe it works some of the time, but will it work all the time?
Quote
#5
That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.
lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
Quote
#6
(Aug-23-2018, 08:01 PM)snippsat Wrote: That link you posted @nilmao is for not using regex with XML/HTML.
lxml is an XML and HTML parser.
lxml Wrote:lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that, so that it will help me to understand better, and also, I’m really a noob who just started to learn Python.

Thank you.
Quote
#7
(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.
Quote
#8
(Aug-23-2018, 07:01 PM)snippsat Wrote: Remove text() from Xpath,can use .text from lxml.
Now can also take out .attrib from CSS class.
from lxml import etree

# Simulate a web page
html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>'''

tree = etree.fromstring(html)
span_tag = tree.xpath("//span[@class='number blue']")
print(span_tag[0].text)
print(span_tag[0].attrib.get('class'))
Output:
xx number blue

Thanks for your reply. However, I want to get the two values (i.e. 001 and 002) within the <td> tags. They all belong to the same span class (i.e. number blue).

Any idea how to get these values neatly?
Quote
#9
(Aug-23-2018, 11:56 PM)snippsat Wrote:
(Aug-23-2018, 11:47 PM)Takeshio Wrote: I’m a little confuse here and can I use lxml module to parse HTML data and extract the specific data within the HTML td tag.

Can you guys please provide me a code snippet of how to achieve that
Yes you can.
Web-Scraping part-1,you see both use of BeautifuSoup and lxml to parse html.
More stuff in Web-scraping part-2 like using Selenium.

Thanks Snippat! I have managed to parsed the HTML data using BeautifulSoup and Lxml, and stored them in a list. Now I will need to find some ways to process and structure these data in the list. Hope it’s not too hard. :-)
snippsat likes this post
Quote
#10
Using xpath() method of ElementTree you could query all td elements without span child like this:

from lxml import html

html_text = """<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <tr>
      <td><span class="number blue">xx</span></td>
      <td>001</td>
      <td>002</td>
    </tr>>
  </body>
</html>"""


et = html.fromstring(html_text)
spans = et.xpath('//tr/td/span[@class="number blue"]')
print(spans[0].text)
for e in et.xpath('//tr/td[not(span)]'):
    print(e.text)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Extract data with Selenium and BeautifulSoup nestor 3 109 3 hours ago
Last Post: Larz60+
  Extract data from sports betting sites nestor 2 193 Apr-18-2020, 01:10 PM
Last Post: law
  Extract json-ld schema markup data and store in MongoDB Nuwan16 0 429 Apr-05-2020, 04:06 PM
Last Post: Nuwan16
  Extract data from a webpage cycloneseb 5 511 Apr-04-2020, 10:17 AM
Last Post: alekson
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 373 Mar-19-2020, 06:13 PM
Last Post: apollo
  Sending data to php page ebolisa 0 173 Mar-18-2020, 05:34 PM
Last Post: ebolisa
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 401 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  Cannot Extract data through charts online AgileAVS 0 180 Feb-01-2020, 01:47 PM
Last Post: AgileAVS
  Scrap data from not standarized page? zarize 4 606 Nov-25-2019, 10:25 AM
Last Post: zarize
  Cannot extract data from the next pages nazmulfinance 4 349 Nov-11-2019, 08:15 PM
Last Post: nazmulfinance

Forum Jump:


Users browsing this thread: 1 Guest(s)