Since I can't post or reply in Turorials forum I am writing here.
I would like a tutorial about scraping web pages using lxml alone. All I have seen a while ago in the internet space doesn't have enough explanations for basic things.
I have the basic in
Web-Scraping part-1.
So there i show how to connect an scrape with
XPath and
CSS Selectors in lxml(and Requests).
lxml has own connector(can read web-pages like Requests/urllib),but it's better to use Requests in all cases.
With this setup can scrape pretty much all out there except JavaScript,
which i have more on in updated
part-2
I couple of examples,copy setup from my tutorial and scrape some on this forum.
Fix new line in you post with XPath
from lxml import html
import requests
url = 'https://python-forum.io/Thread-lxml-tutorial'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
reply = tree.xpath('//*[@id="pid_43585"]/text()[3]')[0]
for new_line in reply.split('.'):
print(new_line.strip())
Output:
I would like a tutorial about scraping web pages using lxml alone
All I have seen a while ago in the internet space doesn't have enough explanations for basic things
Scrape forum titles with CSS Selectors.
# pip install cssselect
from lxml.cssselect import CSSSelector
from lxml import html
import requests
url = 'https://python-forum.io/index.php'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
for forum_name in range(2, 8):
name = tree.cssselect(f'#cat_7_e > tr:nth-child({forum_name}) > td:nth-child(2) > strong > a')[0]
print(name.text)
Output:
General Coding Help
Homework
GUI
Game Development
Networking
Web Development
I mostly use lxml trough
BeautifulSoup(url_get.content, 'lxml')
,
but if like XPath has to use lxml alone as shown,BS do not support XPath only CSS Selectors.
Xpath is like hardcoding the data so if something is changed on the web page it could not work anymore. However, I have seen a lot of articles where the people are saying that they prefer lxml instead of bs4. I want to know why.
I think why many recommend
lxml is speed(use C libraries libxml2 and libxslt),
and support for both XPhat and CCS selector.
lxml help doc do i find confusing
Beautiful Soup 4 is easy to use in almost all cases,
and can also plug in lxml as parser for speed
BeautifulSoup(markup, "lxml")
They both work fine,if use one or they other can just be a preference case.
(Apr-01-2018, 05:20 PM)wavic Wrote: [ -> ]Xpath is like hardcoding the data so if something is changed on the web page it could not work anymore. However, I have seen a lot of articles where the people are saying that they prefer lxml instead of bs4. I want to know why.
The one reason i like xpath is all i have to do is copy the xpath of a highlighted element. I dont really have to think about their structure. Sometimes thats good and sometimes its not...depending on the case.
If the web page changes the code, i actually often find its easier with xpath as i wont remember the structure anymore at that point, as well as i just want to quickly fix the issue.
I also tend to use xpath more with selenium rather than requests/bs4 for some reason. Probably because i am always expecting my selenium scripts to break at some point.
I also find the lxml documents confusing. I had a huge desire a while ago to learn it but this has stopped me. Perhaps I will try again soon. I have more experience now so maybe this time I will get it. Finally...