Hi, anyone here good with lxml?
I'm trying to learn it.
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
also another snippet...
expr = "//*[local-name() = $name]"
print(root.xpath(expr, name = "foo")[0].tag)
output =
foo
The code within the ' ' for buyers, and the " " for expr is a mystery to me.
what is // ?
what is * ?
what is [ ] brackets for ?
what is @x ?
what is local ?
what is text() ?
is this html embedded in python? Is there a webpage that explains all this stuff?
The webpages I'm trying to learn off assume I know all this already
http://python-docs.readthedocs.io/en/lat...crape.html
http://lxml.de/xpathxslt.html
I am not good with lxml, but basically all your questions are about xpath. Gently overvew is
xpath syntax from w3schools.com,
official documentation is much more detailed.
im not good with lxml either. This...
Quote:buyers = tree.xpath('//div[@title="buyer-name"]/text()')
I know is getting
soup.find_all('div', {'title':'buyer-name'}).text
in BeautifulSoup syntax. Im not sure what the other is doing. So i guess i am not much help
I prefer to use it as the parser for BeautifulSoup
from bs4 import BeautifulSoup
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
soup = BeautifulSoup(page.text, 'lxml')
buyers = soup.find_all('div', {'title':'buyer-name'})
for buyer in buyers:
print(buyer.text)
Output:
Carson Busses
Earl E. Byrd
Patty Cakes
Derri Anne Connecticut
Moe Dess
Leda Doggslife
Dan Druff
Al Fresco
Ido Hoe
Howie Kisses
Len Lease
Phil Meup
Ira Pent
Ben D. Rules
Ave Sectomy
Gary Shattire
Bobbi Soks
Sheila Takya
Rose Tattoo
Moe Tell
thx for the answers, you've got me unstuck and I'm moving forward again
I got two a part tutorial about this
1,
2.
I also find BeautifulSoup(with lxml as parser) simpler to work with.
lxml support both XPath and CSS selector,BS support CSS selector.
They also have own methods like
find()
find_all()
to find stuff in HTML/XML.
I like setup like this for learning,
where i have wrote a sample HTML code.
from lxml import etree
html = '''\
<html>
<head>
<title>foo</title>
</head>
<body>
<div id="hero">Superman</div>
<div class="superfoo">
<p>Hulk</p>
</div>
</body>
</html>'''
# lxml using XPath
tree = etree.fromstring(html)
lxml_soup = tree.xpath('//title')
Usage:
>>> etree.tostring(lxml_soup[0])
b'<title>foo</title>\n '
>>> lxml_soup[0].text
'foo'
So
etree.tostring
show result.
Can use
find()
to get text.
Other XPaht testing:
>>> lxml_soup = tree.xpath('//div[@class="superfoo"]')
>>> etree.tostring(lxml_soup[0])
b'<div class="superfoo">\n <p>Hulk</p>\n </div>\n '
>>> lxml_soup[0].find('p')
<Element p at 0x3a06260>
>>> lxml_soup[0].find('p').text
'Hulk'
# Or add text()
>>> lxml_soup = tree.xpath('//div[@class="superfoo"]/p/text()')
>>> lxml_soup[0]
'Hulk'
For both XPath and CSS selector is
cheat sheet helpful.
wow that cheat sheet is nice.
@
snippsat
Do you know what this does?
Quote:expr = "//*[local-name() = $name]"
print(root.xpath(expr, name = "foo")[0].tag)
I was researching it yesterday, but all i found was a thread talking about namespaces. I assume local-name() is putting your variable into lxml namespace?
local-name() returns element name without prefix. That example uses variable substition, so its equal to
root.xpath("//*[local-name()='foo'])[0].tag
and basically searchs for foo tag ...
With snippsat's example tree
Output:
In [5]: tree.xpath("//*[local-name()=$name]", name='div')[0].text
Out[5]: 'Superman'
it just finds 'div' tags. I guess that using local-name() instead of tag name has its uses with more sophisticated XML documents, i have never used it. Actually I have very rarely used xpath, more often css select with beautiful soup. Line like
soup.select("table li div.ItemContent a")
is much better for picking deeply nested content than finding all tables, then lists in these tables etc...
Here's a document you might want to download and read.
First I'd like to vouch for the author John W. Shipman He also
(also at New Mexico Tech) has a reference manual for tkinter
which is excellent.
I expect the lxml manual to be excellent as well
Download here:
http://www.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf
(Apr-28-2017, 05:18 PM)zivoni Wrote: [ -> ]I have very rarely used xpath, more often css select with beautiful soup. Line like
Yes,i use css select more often.
I couple of tips,eg in browser right click(inspect) in html source right click on a tag or text(copy XPath or Selector).
Target
cheat sheet
text in my post.
XPath is cool that you can add
text()
So then it look like this:
from lxml import html
import requests
url = 'https://python-forum.io/Thread-help-with-lxml'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
lxml_soup = tree.xpath('//*[@id="pid_16425"]/a[3]/text()')[0]
print(lxml_soup)
Output:
cheat sheet
Use of console in browser,eg Chrome.
Execute
$x("some_xpath")
or
$$("css-selectors")
in Console panel, which will both evaluate and validate.
Press Ctrl + F to enable DOM searching in the panel.
Type in XPath or CSS selectors to evaluate.
If there are matched elements, they will be highlighted in DOM.