help with lxml

meems · (This post was last modified: Apr-28-2017, 12:43 AM by metulburr.)

Hi, anyone here good with lxml?
I'm trying to learn it.

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
buyers = tree.xpath('//div[@title="buyer-name"]/text()')

also another snippet...

expr = "//*[local-name() = $name]"
print(root.xpath(expr, name = "foo")[0].tag)

output = foo

The code within the ' ' for buyers, and the " " for expr is a mystery to me.

what is // ?
what is * ?
what is [ ] brackets for ?
what is @x ?
what is local ?
what is text() ?

is this html embedded in python? Is there a webpage that explains all this stuff?
The webpages I'm trying to learn off assume I know all this already
http://python-docs.readthedocs.io/en/lat...crape.html
http://lxml.de/xpathxslt.html

***zivoni*** · (This post was last modified: Apr-27-2017, 10:57 PM by zivoni.)

I am not good with lxml, but basically all your questions are about xpath. Gently overvew is xpath syntax from w3schools.com, official documentation is much more detailed.

***metulburr*** · (This post was last modified: Apr-28-2017, 12:08 AM by metulburr.)

im not good with lxml either. This...

Quote:

buyers = tree.xpath('//div[@title="buyer-name"]/text()')

I know is getting soup.find_all('div', {'title':'buyer-name'}).text in BeautifulSoup syntax. Im not sure what the other is doing. So i guess i am not much help Silenced

I prefer to use it as the parser for BeautifulSoup

from bs4 import BeautifulSoup
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
soup = BeautifulSoup(page.text, 'lxml')
buyers = soup.find_all('div', {'title':'buyer-name'})
for buyer in buyers:
    print(buyer.text)

Output:Carson Busses
Earl E. Byrd
Patty Cakes
Derri Anne Connecticut
Moe Dess
Leda Doggslife
Dan Druff
Al Fresco
Ido Hoe
Howie Kisses
Len Lease
Phil Meup
Ira Pent
Ben D. Rules
Ave Sectomy
Gary Shattire
Bobbi Soks
Sheila Takya
Rose Tattoo
Moe Tell

meems · Apr-28-2017, 09:09 AM

thx for the answers, you've got me unstuck and I'm moving forward again

***snippsat*** · (This post was last modified: Apr-28-2017, 02:15 PM by snippsat.)

I got two a part tutorial about this 1, 2.
I also find BeautifulSoup(with lxml as parser) simpler to work with.
lxml support both XPath and CSS selector,BS support CSS selector.
They also have own methods like find() find_all() to find stuff in HTML/XML.

I like setup like this for learning,
where i have wrote a sample HTML code.

from lxml import etree

html = '''\
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <div id="hero">Superman</div>
    <div class="superfoo">
      <p>Hulk</p>
    </div>
  </body>
</html>'''

# lxml using XPath
tree = etree.fromstring(html)
lxml_soup = tree.xpath('//title')

Usage:

>>> etree.tostring(lxml_soup[0])
b'<title>foo</title>\n  '
>>> lxml_soup[0].text
'foo'

So etree.tostring show result.
Can use find() to get text.
Other XPaht testing:

>>> lxml_soup = tree.xpath('//div[@class="superfoo"]')
>>> etree.tostring(lxml_soup[0])
b'<div class="superfoo">\n      <p>Hulk</p>\n    </div>\n  '
>>> lxml_soup[0].find('p')
<Element p at 0x3a06260>
>>> lxml_soup[0].find('p').text
'Hulk'

# Or add text()
>>> lxml_soup = tree.xpath('//div[@class="superfoo"]/p/text()')
>>> lxml_soup[0]
'Hulk'

For both XPath and CSS selector is cheat sheet helpful.

***metulburr*** · Apr-28-2017, 04:29 PM

wow that cheat sheet is nice.

@snippsat
Do you know what this does?

Quote:expr = "//*[local-name() = $name]"
print(root.xpath(expr, name = "foo")[0].tag)

I was researching it yesterday, but all i found was a thread talking about namespaces. I assume local-name() is putting your variable into lxml namespace? Huh

***zivoni*** · Apr-28-2017, 05:18 PM

local-name() returns element name without prefix. That example uses variable substition, so its equal to root.xpath("//*[local-name()='foo'])[0].tag and basically searchs for foo tag ...

With snippsat's example tree

Output:In [5]: tree.xpath("//*[local-name()=$name]", name='div')[0].text
Out[5]: 'Superman'

it just finds 'div' tags. I guess that using local-name() instead of tag name has its uses with more sophisticated XML documents, i have never used it. Actually I have very rarely used xpath, more often css select with beautiful soup. Line like

soup.select("table li div.ItemContent a")

is much better for picking deeply nested content than finding all tables, then lists in these tables etc...

**Larz60+** · Apr-28-2017, 05:21 PM

Here's a document you might want to download and read.
First I'd like to vouch for the author John W. Shipman He also
(also at New Mexico Tech) has a reference manual for tkinter
which is excellent.

I expect the lxml manual to be excellent as well
Download here: http://www.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf

***snippsat*** · Apr-28-2017, 06:36 PM

(Apr-28-2017, 05:18 PM)zivoni Wrote: I have very rarely used xpath, more often css select with beautiful soup. Line like

Yes,i use css select more often.

I couple of tips,eg in browser right click(inspect) in html source right click on a tag or text(copy XPath or Selector).
Target cheat sheet text in my post.
XPath is cool that you can add text()
So then it look like this:

from lxml import html
import requests

url = 'https://python-forum.io/Thread-help-with-lxml'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
lxml_soup = tree.xpath('//*[@id="pid_16425"]/a[3]/text()')[0]
print(lxml_soup)

Output:
cheat sheet

Use of console in browser,eg Chrome.
Execute $x("some_xpath") or $$("css-selectors") in Console panel, which will both evaluate and validate.

Press Ctrl + F to enable DOM searching in the panel.
Type in XPath or CSS selectors to evaluate.
If there are matched elements, they will be highlighted in DOM.

help with lxml

User Panel Messages

Announcements