Learning advanced lxml - Printable Version

Learning advanced lxml - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Learning advanced lxml (/thread-2806.html)

Learning advanced lxml - Larz60+ - Apr-12-2017

I started (about two hours ago) digging into lxml, for no other reason than
I want to develop a better knowledge of what it's capable of.

Getting a utf-8 encoding error,
i tried: etree.tostring but that didn't seem to do the trick

from lxml import etree
import requests
import socket


class TryLxml2:
   def __init__(self, url=None):
       try:
           if socket.gethostbyname(socket.gethostname()) != '127.0.0.1':
               with open('data\\rfc-index.xml', 'wb') as f:
                   self.response = requests.get(url, stream=True)

       except Exception as ex:
           # todo -- Use tkinter.messagebox here
           template = "An exception of type {0} occurred. arguments: \n{1!r}"
           message = template.format(type(ex).__name__, ex.args)
           print(message)
           # raise Exception

       try:
           # etree.tostring(self.xml, encoding='UTF-8', xml_declaration=False)
           doc = etree.XML(self.response.text.strip())
           rfc_entry = doc.findtext('rfc-entry')
       except etree.XMLSyntaxError as ex:
           template = "An exception of type {0} occurred. arguments: \n{1!r}"
           message = template.format(type(ex).__name__, ex.args)
           print(message)

def main(url):
   TryLxml2(url)

if __name__ == '__main__':
   filename = 'data\dev_data.xml'
   main('https://www.rfc-editor.org/rfc-index.xml')

Error:Traceback (most recent call last):
 File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 35, in <module>
   main('https://www.rfc-editor.org/rfc-index.xml')
 File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 31, in main
   TryLxml2(url)
 File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 23, in __init__
   doc = etree.XML(self.response.text.strip())
 File "src\lxml\lxml.etree.pyx", line 3192, in lxml.etree.XML (src\lxml\lxml.etree.c:78747)
 File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118266)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

By the way, found a good reference for this by John Shipman who has an excellent tkinter reference as well.

RE: Learning advanced lxml - wavic - Apr-12-2017

Cut off the encoding declaration from the document.

RE: Learning advanced lxml - snippsat - Apr-12-2017

doc = etree.XML(self.response.text.strip())
# Change to
doc = etree.XML(self.response.content.strip())

I often find it easier to call lxml parser from BS.

import requests
from bs4 import BeautifulSoup

url = 'https://www.rfc-editor.org/rfc-index.xml'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml-xml')
entry = soup.find('rfc-entry')

Test:

>>> entry
<rfc-entry>
<doc-id>RFC0001</doc-id>
<title>Host Software</title>
<author>
<name>S. Crocker</name>
</author>
<date>
<month>April</month>
<year>1969</year>
</date>
<format>
<file-format>ASCII</file-format>
<char-count>21088</char-count>
<page-count>11</page-count>
</format>
<current-status>UNKNOWN</current-status>
<publication-status>UNKNOWN</publication-status>
<stream>Legacy</stream>
<doi>10.17487/RFC0001</doi>
</rfc-entry>

>>> entry.find('name')
<name>S. Crocker</name>
>>> entry.find('name').text
'S. Crocker'

RE: Learning advanced lxml - Larz60+ - Apr-12-2017

Thanks again Snippsat, you're very knowledgeable in this area, and
teaching an old dog a lot of new tricks!

I have used it from bs4, and was doing an exercise without. Kind of
like removing training wheels. Ultimately I will always be using bs4 and
lxml together.

I'll forget about going bareback for now!