I have the basic in
Web-Scraping part-1.
So there i show how to connect an scrape with
XPath and
CSS Selectors in lxml(and Requests).
lxml has own connector(can read web-pages like Requests/urllib),but it's better to use Requests in all cases.
With this setup can scrape pretty much all out there except JavaScript,
which i have more on in updated
part-2
I couple of examples,copy setup from my tutorial and scrape some on this forum.
Fix new line in you post with XPath
from lxml import html
import requests
url = 'https://python-forum.io/Thread-lxml-tutorial'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
reply = tree.xpath('//*[@id="pid_43585"]/text()[3]')[0]
for new_line in reply.split('.'):
print(new_line.strip())
Output:
I would like a tutorial about scraping web pages using lxml alone
All I have seen a while ago in the internet space doesn't have enough explanations for basic things
Scrape forum titles with CSS Selectors.
# pip install cssselect
from lxml.cssselect import CSSSelector
from lxml import html
import requests
url = 'https://python-forum.io/index.php'
url_get = requests.get(url)
tree = html.fromstring(url_get.content)
for forum_name in range(2, 8):
name = tree.cssselect(f'#cat_7_e > tr:nth-child({forum_name}) > td:nth-child(2) > strong > a')[0]
print(name.text)
Output:
General Coding Help
Homework
GUI
Game Development
Networking
Web Development
I mostly use lxml trough
BeautifulSoup(url_get.content, 'lxml')
,
but if like XPath has to use lxml alone as shown,BS do not support XPath only CSS Selectors.