Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
lxml tutorial?
#1
Since I can't post or reply in Turorials forum I am writing here.

I would like a tutorial about scraping web pages using lxml alone. All I have seen a while ago in the internet space doesn't have enough explanations for basic things.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#2
I have the basic in Web-Scraping part-1.
So there i show how to connect an scrape with XPath and CSS Selectors in lxml(and Requests).
lxml has own connector(can read web-pages like Requests/urllib),but it's better to use Requests in all cases.

With this setup can scrape pretty much all out there except JavaScript,
which i have more on in updated part-2

I couple of examples,copy setup from my tutorial and scrape some on this forum.
Fix new line in you post with XPath Wink
from lxml import html
import requests

url = 'https://python-forum.io/Thread-lxml-tutorial'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
reply = tree.xpath('//*[@id="pid_43585"]/text()[3]')[0]
for new_line in reply.split('.'):
    print(new_line.strip())
Output:
I would like a tutorial about scraping web pages using lxml alone All I have seen a while ago in the internet space doesn't have enough explanations for basic things
Scrape forum titles with CSS Selectors.
# pip install cssselect
from lxml.cssselect import CSSSelector
from lxml import html
import requests

url = 'https://python-forum.io/index.php'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
for forum_name in range(2, 8):
    name = tree.cssselect(f'#cat_7_e > tr:nth-child({forum_name}) > td:nth-child(2) > strong > a')[0]
    print(name.text)
Output:
General Coding Help Homework GUI Game Development Networking Web Development
I mostly use lxml trough BeautifulSoup(url_get.content, 'lxml'),
but if like XPath has to use lxml alone as shown,BS do not support XPath only CSS Selectors.
Reply
#3
Xpath is like hardcoding the data so if something is changed on the web page it could not work anymore. However, I have seen a lot of articles where the people are saying that they prefer lxml instead of bs4. I want to know why.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#4
I think why many recommend lxml is speed(use C libraries libxml2 and libxslt),
and support for both XPhat and CCS selector.
lxml help doc do i find confusing Confused

Beautiful Soup 4 is easy to use in almost all cases,
and can also plug in lxml as parser for speed BeautifulSoup(markup, "lxml")
They both work fine,if use one or they other can just be a preference case.
Reply
#5
(Apr-01-2018, 05:20 PM)wavic Wrote: Xpath is like hardcoding the data so if something is changed on the web page it could not work anymore. However, I have seen a lot of articles where the people are saying that they prefer lxml instead of bs4. I want to know why.
The one reason i like xpath is all i have to do is copy the xpath of a highlighted element. I dont really have to think about their structure. Sometimes thats good and sometimes its not...depending on the case.

If the web page changes the code, i actually often find its easier with xpath as i wont remember the structure anymore at that point, as well as i just want to quickly fix the issue.

I also tend to use xpath more with selenium rather than requests/bs4 for some reason. Probably because i am always expecting my selenium scripts to break at some point.
Recommended Tutorials:
Reply
#6
I also find the lxml documents confusing. I had a huge desire a while ago to learn it but this has stopped me. Perhaps I will try again soon. I have more experience now so maybe this time I will get it. Finally...
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020