Python Forum
Thread Rating:
  • 2 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
help with lxml
#1
Hi, anyone here good with lxml?
I'm trying to learn it.


page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
also another snippet...
expr = "//*[local-name() = $name]"
print(root.xpath(expr, name = "foo")[0].tag)
output = foo

The code within the ' ' for buyers, and the " " for expr is a mystery to me.

what is // ?
what is * ?
what is  [ ] brackets for  ?
what is @x ?
what is local ?
what is text() ?

is this html embedded in python? Is there a webpage that explains all this stuff?
The webpages I'm trying to learn off assume I know all this already
http://python-docs.readthedocs.io/en/lat...crape.html
http://lxml.de/xpathxslt.html
Reply
#2
I am not good with lxml, but basically all your questions are about xpath. Gently overvew is xpath syntax from w3schools.com, official documentation is much more detailed.
Reply
#3
im not good with lxml either. This...

Quote:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
I know is getting soup.find_all('div', {'title':'buyer-name'}).text in BeautifulSoup syntax. Im not sure what the other is doing. So i guess i am not much help  Silenced

I prefer to use it as the parser for BeautifulSoup

from bs4 import BeautifulSoup
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
soup = BeautifulSoup(page.text, 'lxml')
buyers = soup.find_all('div', {'title':'buyer-name'})
for buyer in buyers:
    print(buyer.text)
Output:
Carson Busses Earl E. Byrd Patty Cakes Derri Anne Connecticut Moe Dess Leda Doggslife Dan Druff Al Fresco Ido Hoe Howie Kisses Len Lease Phil Meup Ira Pent Ben D. Rules Ave Sectomy Gary Shattire Bobbi Soks Sheila Takya Rose Tattoo Moe Tell
Recommended Tutorials:
Reply
#4
thx for the answers, you've got me unstuck and I'm moving forward again
Reply
#5
I got two a part tutorial about this 1, 2.
I also find BeautifulSoup(with lxml as parser) simpler to work with.
lxml support both XPath and CSS selector,BS support CSS selector.
They also have own methods like find() find_all() to find stuff in HTML/XML.

I like setup like this for learning,
where i have wrote a sample HTML code.
from lxml import etree

html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <div id="hero">Superman</div>
    <div class="superfoo">
      <p>Hulk</p>
    </div>
  </body>
</html>'''

# lxml using XPath
tree = etree.fromstring(html)
lxml_soup = tree.xpath('//title')
Usage:
>>> etree.tostring(lxml_soup[0])
b'<title>foo</title>\n  '
>>> lxml_soup[0].text
'foo'
So etree.tostring show result.
Can use find() to get text.
Other XPaht testing:
>>> lxml_soup = tree.xpath('//div[@class="superfoo"]')
>>> etree.tostring(lxml_soup[0])
b'<div class="superfoo">\n      <p>Hulk</p>\n    </div>\n  '
>>> lxml_soup[0].find('p')
<Element p at 0x3a06260>
>>> lxml_soup[0].find('p').text
'Hulk'

# Or add text()
>>> lxml_soup = tree.xpath('//div[@class="superfoo"]/p/text()')
>>> lxml_soup[0]
'Hulk'
For both XPath and CSS selector is cheat sheet helpful.
Reply
#6
wow that cheat sheet is nice.

@snippsat
Do you know what this does?
Quote:expr = "//*[local-name() = $name]"
print(root.xpath(expr, name = "foo")[0].tag)
I was researching it yesterday, but all i found was a thread talking about namespaces. I assume local-name() is putting your variable into lxml namespace?  Huh
Recommended Tutorials:
Reply
#7
local-name() returns element name without prefix. That example uses variable substition, so its equal to root.xpath("//*[local-name()='foo'])[0].tag and basically searchs for foo tag ...

With snippsat's example tree
Output:
In [5]: tree.xpath("//*[local-name()=$name]", name='div')[0].text Out[5]: 'Superman'
it just finds 'div' tags. I guess that using local-name() instead of tag name has its uses with more sophisticated XML documents, i have never used it. Actually I have very rarely used xpath, more often css select with beautiful soup. Line like
soup.select("table li div.ItemContent a")
is much better for picking deeply nested content than finding all tables, then lists in these tables  etc...
Reply
#8
Here's a document you might want to download and read.
First I'd like to vouch for the author John W. Shipman He also
(also at New Mexico Tech) has a reference manual for tkinter
which is excellent.

I expect the lxml manual to be excellent as well
Download here: http://www.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf
Reply
#9
(Apr-28-2017, 05:18 PM)zivoni Wrote: I have very rarely used xpath, more often css select with beautiful soup. Line like
Yes,i use css select more often. 

I couple of tips,eg in browser right click(inspect) in html source right click on a tag or text(copy XPath or Selector).
Target cheat sheet text in my post.
XPath is cool that you can add text()
So then it look like this:
from lxml import html
import requests

url = 'https://python-forum.io/Thread-help-with-lxml'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
lxml_soup = tree.xpath('//*[@id="pid_16425"]/a[3]/text()')[0]
print(lxml_soup)
Output:
cheat sheet

Use of console in browser,eg Chrome.
Execute $x("some_xpath") or $$("css-selectors") in Console panel, which will both evaluate and validate.

Press Ctrl + F to enable DOM searching in the panel.
Type in XPath or CSS selectors to evaluate.
If there are matched elements, they will be highlighted in DOM.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020