Learning advanced lxml

**Larz60+** · Apr-12-2017, 04:34 AM

I started (about two hours ago) digging into lxml, for no other reason than
I want to develop a better knowledge of what it's capable of.

Getting a utf-8 encoding error,
i tried: etree.tostring but that didn't seem to do the trick

from lxml import etree
import requests
import socket


class TryLxml2:
   def __init__(self, url=None):
       try:
           if socket.gethostbyname(socket.gethostname()) != '127.0.0.1':
               with open('data\\rfc-index.xml', 'wb') as f:
                   self.response = requests.get(url, stream=True)

       except Exception as ex:
           # todo -- Use tkinter.messagebox here
           template = "An exception of type {0} occurred. arguments: \n{1!r}"
           message = template.format(type(ex).__name__, ex.args)
           print(message)
           # raise Exception

       try:
           # etree.tostring(self.xml, encoding='UTF-8', xml_declaration=False)
           doc = etree.XML(self.response.text.strip())
           rfc_entry = doc.findtext('rfc-entry')
       except etree.XMLSyntaxError as ex:
           template = "An exception of type {0} occurred. arguments: \n{1!r}"
           message = template.format(type(ex).__name__, ex.args)
           print(message)

def main(url):
   TryLxml2(url)

if __name__ == '__main__':
   filename = 'data\dev_data.xml'
   main('https://www.rfc-editor.org/rfc-index.xml')

Error:Traceback (most recent call last):
 File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 35, in <module>
   main('https://www.rfc-editor.org/rfc-index.xml')
 File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 31, in main
   TryLxml2(url)
 File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 23, in __init__
   doc = etree.XML(self.response.text.strip())
 File "src\lxml\lxml.etree.pyx", line 3192, in lxml.etree.XML (src\lxml\lxml.etree.c:78747)
 File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118266)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

By the way, found a good reference for this by John Shipman who has an excellent tkinter reference as well.

wavic · (This post was last modified: Apr-12-2017, 01:06 PM by wavic.)

Cut off the encoding declaration from the document.

***snippsat*** · Apr-12-2017, 12:57 PM

doc = etree.XML(self.response.text.strip())
# Change to
doc = etree.XML(self.response.content.strip())

I often find it easier to call lxml parser from BS.

import requests
from bs4 import BeautifulSoup

url = 'https://www.rfc-editor.org/rfc-index.xml'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml-xml')
entry = soup.find('rfc-entry')

Test:

>>> entry
<rfc-entry>
<doc-id>RFC0001</doc-id>
<title>Host Software</title>
<author>
<name>S. Crocker</name>
</author>
<date>
<month>April</month>
<year>1969</year>
</date>
<format>
<file-format>ASCII</file-format>
<char-count>21088</char-count>
<page-count>11</page-count>
</format>
<current-status>UNKNOWN</current-status>
<publication-status>UNKNOWN</publication-status>
<stream>Legacy</stream>
<doi>10.17487/RFC0001</doi>
</rfc-entry>

>>> entry.find('name')
<name>S. Crocker</name>
>>> entry.find('name').text
'S. Crocker'

**Larz60+** · Apr-12-2017, 05:22 PM

Thanks again Snippsat, you're very knowledgeable in this area, and
teaching an old dog a lot of new tricks!

I have used it from bs4, and was doing an exercise without. Kind of
like removing training wheels. Ultimately I will always be using bs4 and
lxml together.

I'll forget about going bareback for now!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How do you create an advanced filtering system?	KirkmanJ	0	2,398	Jul-02-2018, 08:34 AM Last Post: KirkmanJ

Learning advanced lxml

User Panel Messages

Announcements