![]() |
Learning advanced lxml - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Learning advanced lxml (/thread-2806.html) |
Learning advanced lxml - Larz60+ - Apr-12-2017 I started (about two hours ago) digging into lxml, for no other reason than I want to develop a better knowledge of what it's capable of. Getting a utf-8 encoding error, i tried: etree.tostring but that didn't seem to do the trick from lxml import etree import requests import socket class TryLxml2: def __init__(self, url=None): try: if socket.gethostbyname(socket.gethostname()) != '127.0.0.1': with open('data\\rfc-index.xml', 'wb') as f: self.response = requests.get(url, stream=True) except Exception as ex: # todo -- Use tkinter.messagebox here template = "An exception of type {0} occurred. arguments: \n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) # raise Exception try: # etree.tostring(self.xml, encoding='UTF-8', xml_declaration=False) doc = etree.XML(self.response.text.strip()) rfc_entry = doc.findtext('rfc-entry') except etree.XMLSyntaxError as ex: template = "An exception of type {0} occurred. arguments: \n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) def main(url): TryLxml2(url) if __name__ == '__main__': filename = 'data\dev_data.xml' main('https://www.rfc-editor.org/rfc-index.xml') By the way, found a good reference for this by John Shipman who has an excellent tkinter reference as well.
RE: Learning advanced lxml - wavic - Apr-12-2017 Cut off the encoding declaration from the document. RE: Learning advanced lxml - snippsat - Apr-12-2017 doc = etree.XML(self.response.text.strip()) # Change to doc = etree.XML(self.response.content.strip())I often find it easier to call lxml parser from BS. import requests from bs4 import BeautifulSoup url = 'https://www.rfc-editor.org/rfc-index.xml' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml-xml') entry = soup.find('rfc-entry')Test: >>> entry <rfc-entry> <doc-id>RFC0001</doc-id> <title>Host Software</title> <author> <name>S. Crocker</name> </author> <date> <month>April</month> <year>1969</year> </date> <format> <file-format>ASCII</file-format> <char-count>21088</char-count> <page-count>11</page-count> </format> <current-status>UNKNOWN</current-status> <publication-status>UNKNOWN</publication-status> <stream>Legacy</stream> <doi>10.17487/RFC0001</doi> </rfc-entry> >>> entry.find('name') <name>S. Crocker</name> >>> entry.find('name').text 'S. Crocker' RE: Learning advanced lxml - Larz60+ - Apr-12-2017 Thanks again Snippsat, you're very knowledgeable in this area, and teaching an old dog a lot of new tricks! I have used it from bs4, and was doing an exercise without. Kind of like removing training wheels. Ultimately I will always be using bs4 and lxml together. I'll forget about going bareback for now! |