Python Forum
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Learning advanced lxml
#1
I started (about two hours ago) digging into lxml, for no other reason than
I want to develop a better knowledge of what it's capable of.

Getting a utf-8 encoding error,
i tried: etree.tostring but that didn't seem to do the trick


from lxml import etree
import requests
import socket


class TryLxml2:
   def __init__(self, url=None):
       try:
           if socket.gethostbyname(socket.gethostname()) != '127.0.0.1':
               with open('data\\rfc-index.xml', 'wb') as f:
                   self.response = requests.get(url, stream=True)

       except Exception as ex:
           # todo -- Use tkinter.messagebox here
           template = "An exception of type {0} occurred. arguments: \n{1!r}"
           message = template.format(type(ex).__name__, ex.args)
           print(message)
           # raise Exception

       try:
           # etree.tostring(self.xml, encoding='UTF-8', xml_declaration=False)
           doc = etree.XML(self.response.text.strip())
           rfc_entry = doc.findtext('rfc-entry')
       except etree.XMLSyntaxError as ex:
           template = "An exception of type {0} occurred. arguments: \n{1!r}"
           message = template.format(type(ex).__name__, ex.args)
           print(message)

def main(url):
   TryLxml2(url)

if __name__ == '__main__':
   filename = 'data\dev_data.xml'
   main('https://www.rfc-editor.org/rfc-index.xml')
Error:
Traceback (most recent call last):  File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 35, in <module>    main('https://www.rfc-editor.org/rfc-index.xml')  File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 31, in main    TryLxml2(url)  File "M:/python/q-t/r/RFC_Library/src/TryLxml2.py", line 23, in __init__    doc = etree.XML(self.response.text.strip())  File "src\lxml\lxml.etree.pyx", line 3192, in lxml.etree.XML (src\lxml\lxml.etree.c:78747)  File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118266) ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
By the way, found a good reference for this by John Shipman who has an excellent tkinter reference as well.
Reply
#2
Cut off the encoding declaration from the document.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
doc = etree.XML(self.response.text.strip())
# Change to
doc = etree.XML(self.response.content.strip())
I often find it easier to call lxml parser from BS.
import requests
from bs4 import BeautifulSoup

url = 'https://www.rfc-editor.org/rfc-index.xml'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml-xml')
entry = soup.find('rfc-entry')
Test:
>>> entry
<rfc-entry>
<doc-id>RFC0001</doc-id>
<title>Host Software</title>
<author>
<name>S. Crocker</name>
</author>
<date>
<month>April</month>
<year>1969</year>
</date>
<format>
<file-format>ASCII</file-format>
<char-count>21088</char-count>
<page-count>11</page-count>
</format>
<current-status>UNKNOWN</current-status>
<publication-status>UNKNOWN</publication-status>
<stream>Legacy</stream>
<doi>10.17487/RFC0001</doi>
</rfc-entry>

>>> entry.find('name')
<name>S. Crocker</name>
>>> entry.find('name').text
'S. Crocker'
Reply
#4
Thanks again Snippsat, you're very knowledgeable in this area, and
teaching an old dog a lot of new tricks!

I have used it from bs4, and was doing an exercise without. Kind of
like removing training wheels. Ultimately I will always be using bs4 and
lxml together.

I'll forget about going bareback for now!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How do you create an advanced filtering system? KirkmanJ 0 2,398 Jul-02-2018, 08:34 AM
Last Post: KirkmanJ

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020