Python Forum

Full Version: Parse XML with Namespaces
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello All -

I am trying to parse some xml files that are saturated with Namespaces. I have been able to parse XML files without namespaces. I have found several articles online on parsing with Namespaces, but the namespaces in my file don't seem like most of the examples I have found online.

Below is an example of the XML. I am trying to get to the ConnectionString property.

I have tried using findall with the full hierarchy like
conn_mgrs = root.findall('ConnectionManagers/ConnectionManager/ObjectData/ConnectionManager')
And, also using the namespace argument like:
ns = {'XYZ': 'www.example.com/myExample/Xyz'}
conn_mgrs = root.findall('ConnectionManagers/ConnectionManager/ObjectData/ConnectionManager', ns)
Both just return a null element.

My next move is to probably strip out the namespace prefixes and then parse the file, but figured i'd check with others to see if someone knows a way to resolve.

Thanks for any help

The XML looks like this:

<XYZ:Executable xmlns:XYZ="www.example.com/myExample/Xyz"
  XYZ:Id="Package"
  XYZ:CreationDate="2/21/2018 11:11:48 AM"
  XYZ:XYZID="{FB8BE06B-76B6-44DA-B2C7-043BD0989CBF}"
  XYZ:ObjectName="MyTestProject"
  XYZ:VersionGUID="{8D9F7CDA-590E-44C3-8896-786D27167F7D}">
  <XYZ:Property
    XYZ:Name="PackageFormatVersion">6</XYZ:Property>
  <XYZ:ConnectionManagers>
    <XYZ:ConnectionManager
      XYZ:refId="Package.ConnectionManagers[RTG093939BB.AdminDB]"
      XYZ:CreationName="OLEDB"
      XYZ:XYZID="{C67B6283-781F-4B0E-A9A7-376A157B6F16}"
      XYZ:ObjectName="RTG093939BB.AdminDB">
      <XYZ:ObjectData>
        <XYZ:ConnectionManager
          XYZ:ConnectionString="Data Source=RTG093939BB;Initial Catalog=AdminDB;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
      </XYZ:ObjectData>
    </XYZ:ConnectionManager>
    <XYZ:ConnectionManager
      XYZ:refId="Package.ConnectionManagers[RTG093955XT.Stage]"
      XYZ:CreationName="OLEDB"
      XYZ:XYZID="{8B4F57EA-03EA-49FA-B4BD-828A89FE5A32}"
      XYZ:ObjectName="RTG093955XT.Stage">
      <XYZ:ObjectData>
        <XYZ:ConnectionManager
          XYZ:ConnectionString="Data Source=RTG093955XT;Initial Catalog=Stage;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
      </XYZ:ObjectData>
    </XYZ:ConnectionManager>
  </XYZ:ConnectionManagers>
 </XYZ:Executable>
I've had some limited success parsing your xml with etree,
but having difficulty with 'connection managers' info as
it shows up as a dictionary without any keys.

Here's what I did:
  • copied your XML into a file named 'ziggy.xml'
  • you can change the name to whatever you want in the testit function
  • Started parsing nodes, had no problem with the root node, or the Property node
  • Dictionary problem (stated previously) shows up with ConnectionManager node.

Play with it, perhaps you can figure it out, I need a break.

The code:
import xml.etree.ElementTree as et

class ParseXmlWithNamespace:
    def __init__(self, xml_filename):
        self.tree = et.parse('ziggy.xml')
        self.root = self.tree.getroot()
        self.show_root_info()
        self.show_child_info()
        self.parser = et.XMLPullParser(['start', 'end'])

    def show_root_info(self):
        for item in self.root.items():
            for n, field in enumerate(item):
                if n == 0:
                    p = field.index('}')
                    print(f'{field[p+1:] :20}: ', end='')
                else:
                    print(field)

    def show_child_info(self):
        root = self.root
        for child in root:
            print(f'\ntag type: {type(child.tag)}')
            print(f'tag value: {child.tag}')
            print(f'attrib type: {type(child.attrib)}')
            print(f'attrib value: {child.attrib}')
            if isinstance(child.attrib, dict):
                print(f'    attrib keys: {child.attrib.keys()}')
            else:
                print(f'    attrib: {child.attrib}')

def tryit():
    px = ParseXmlWithNamespace('ziggy.xml')

if __name__ == '__main__':
    tryit()
results so far:
Output:
Id : Package CreationDate : 2/21/2018 11:11:48 AM XYZID : {FB8BE06B-76B6-44DA-B2C7-043BD0989CBF} ObjectName : MyTestProject VersionGUID : {8D9F7CDA-590E-44C3-8896-786D27167F7D} tag type: <class 'str'> tag value: {www.example.com/myExample/Xyz}Property attrib type: <class 'dict'> attrib value: {'{www.example.com/myExample/Xyz}Name': 'PackageFormatVersion'} attrib keys: dict_keys(['{www.example.com/myExample/Xyz}Name']) tag type: <class 'str'> tag value: {www.example.com/myExample/Xyz}ConnectionManagers attrib type: <class 'dict'> attrib value: {} attrib keys: dict_keys([])
Hi -
Thanks so much for the help. I will give this a try and update this post as things happen.

thanks...
(Apr-11-2018, 09:51 PM)dwill Wrote: [ -> ]And, also using the namespace argument like:
ns = {'XYZ': 'www.example.com/myExample/Xyz'}
conn_mgrs = root.findall('ConnectionManagers/ConnectionManager/ObjectData/ConnectionManager', ns)
Both just return a null element.

In this code, you're defining your namespace, but you're not actually using it.
This gets the elements you want:
>>> root.findall('XYZ:ConnectionManagers/XYZ:ConnectionManager/XYZ:ObjectData/XYZ:ConnectionManager', ns)
[<Element '{www.example.com/myExample/Xyz}ConnectionManager' at 0x000001FADD450818>, <Element '{www.example.com/myExample/Xyz}ConnectionManager' at 0x000001FADD450908>]
I'd also suggest using lxml instead of the builtin xml.etree, as it will give you full XPath support, and is also much faster.
Hello -
Thanks so much. I just did a quick test and this works great. I will try it on a few other parts of the xml file as well.

One question, as a test, I changed to using lxml and the same code works. But, you also suggested using xpath for my searching/parsing. In my example, would it be something like:
conn_mgrs = root.xpath('XYZ:ConnectionManagers/XYZ:ConnectionManager/XYZ:ObjectData/XYZ:ConnectionManager', namespaces=ns)
I did try that and the output was the same as using the builtin xml.etree. Just wanted to make sure I was understanding your advice.

thank you!
The difference is that the built-in module only lets you use a subset of XPath, making certain things more complicated, or impossible.
For example, to get the ConnectionString attribute in lxml, you can simply do this:
>>> root.xpath('XYZ:ConnectionManagers/XYZ:ConnectionManager/XYZ:ObjectData/XYZ:ConnectionManager/@XYZ:ConnectionString', namespaces=ns)
['Data Source=RTG093939BB;Initial Catalog=AdminDB;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;', 'Data Source=RTG093955XT;Initial Catalog=Stage;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;']
Hi -

This is much better than what I was using. I had for loop and was getting each connection string, but I was navigating down the xml hierarchy. Your method gets right to the place I need, and because there could be multiple, at least my For loop is much smaller.

I am trying to also get one of the attributes, "Initial Catalog". I am close, but still missing something. I will post my code so you can see what I am trying and if it makes sense...in a pythonic world.

Thank you again for all of your help
Hi -
So, this is what I ended up with to get two of the items from this connection manager list:

cnxn_string = root.xpath('XYZ:ConnectionManagers/XYZ:ConnectionManager/XYZ:ObjectData/XYZ:ConnectionManager/@XYZ:ConnectionString', namespaces=ns)
 # print(cnxn_string)
 for item in cnxn_string:
    # print(item)
    c = item.split(';')
    for q in c:
        # print(q)
        if q.startswith('Data Source'):
            ds = q[q.find('=') + 1:]
            print(ds)
        if q.startswith('Initial Catalog'):
            ic = q[q.find('=') + 1:]
            print(ic)
Output:
Output:
RTG093939BB AdminDB RTG093955XT Stage