Best way to process large/complex XML/schema ?

Best way to process large/complex XML/schema ? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Best way to process large/complex XML/schema ? (/thread-33644.html)

Best way to process large/complex XML/schema ? - MDRI - May-13-2021

Best way to process large/complex XML/schema ?
==============================================

Thanks for reviewing this thread.

I like to figure out a way to process large complex XML and push the XML data to flat file or Data base.

Here is the high level view.

1) Input - Large/huge XML with complex nested/choice etc (Similar to HL7 kind)
2) The above input XML messages need to be validated against XML Schema (xsd) for schema comliant
30 The validated XML messages to be parses and extract the data .
4) The extracted data to be pushed to Flat file or Data base.

We know Python is interpreter language.

Is Pyhton the right one to do the above for performance?

What are the option we have ?

Thanks for your guidance.

RE: Best way to process large/complex XML/schema ? - snippsat - May-13-2021

(May-13-2021, 01:44 AM)MDRI Wrote: We know Python is interpreter language.

Is Pyhton the right one to do the above for performance?

What are the option we have ?

Performance is no problem as eg lxml has C speed.

lxml Wrote:The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt.
It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API

(May-13-2021, 01:44 AM)MDRI Wrote: The above input XML messages need to be validated against XML Schema (xsd) for schema comliant

Validation with lxml

Quote:The validated XML messages to be parses and extract the data .

I like to use BS for parsing,still same speed as here use lxml as parser.

import requests
from bs4 import BeautifulSoup

url = 'http://httpbin.org/xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('title')
print(title)
print(title.text)

Output:<title>Wake up to WonderWidgets!</title>
Wake up to WonderWidgets!

RE: Best way to process large/complex XML/schema ? - MDRI - May-14-2021

Thanks for your response.

I like to upload a complex xml message to share here, so we can look , how the above code work.

What is the upload limitation in this forum?

I did not see an upload option . Is it feasible?

Shall I upload to google drive and share link here?

Thanks for your guidance.

RE: Best way to process large/complex XML/schema ? - snippsat - May-15-2021

(May-14-2021, 06:42 PM)MDRI Wrote: Shall I upload to google drive and share link here?

That the best is file is large.
Usually can short down the xml file so have sample file to try different stuff on.
You could also to some training on smaller sample file if you new to lxml/BS.

RE: Best way to process large/complex XML/schema ? - MDRI - May-15-2021

Thanks for your reply.

I am attaching a .xml file extracted from big xml to show . It has complexly nested structure and it has its .xsd as well.

The header portion looks like

<?xml version="1.0" encoding="UTF-8"?>
 
<!--Sample XML file generated by XMLSpy v2016 rel. 2 sp1 (x64) (http://www.altova.com)-->
 
-<DataFileForEFDS>
 
<ReturnCount>0</ReturnCount>
 
 
-<DataRecord>
 
<SubmissionId>00000000000000002222</SubmissionId>
 
<DLN>12345678901234</DLN>
 
<ElectronicPostmark>2001-12-17T09:30:47Z</ElectronicPostmark>
 
<ETIN>String</ETIN>
 
<TransmitterIPAddress>String</TransmitterIPAddress>
 
<TransmitterTimestamp>2001-12-17T09:30:47Z</TransmitterTimestamp>
 
<ReservedIPAddressCd>String</ReservedIPAddressCd>
 
<MeFRuleNumber>String</MeFRuleNumber>
 
<MeFRuleNumber>String</MeFRuleNumber>
 
 
-<Return returnVersion="String">
 
 
-<ReturnData documentCnt="2">

Are you able to point me to a code snippet /template to process this XML ?

Thanks for your guidance.

RE: Best way to process large/complex XML/schema ? - snippsat - May-15-2021

Here a example of how i would read it and parse some data.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml')

# Just copy from and doc and lower search or write lower case
sub_id = soup.find('SubmissionId'.lower())
# Tag and text
print(sub_id)
print(sub_id.text)

#---| Take out  part eg doc 2,then <find_all> of a tag that there are several of
doc_2 = soup.find('returndata', {'documentcnt': '2'})
dep_detail =  doc_2.find_all('DependentDetail'.lower())
print('-' * 30)
print(dep_detail[0].find('dependentrelationshipcd'))
print(dep_detail[0].find('dependentrelationshipcd').text)

Output:<submissionid>00000000000000002222</submissionid>
00000000000000002222
------------------------------
<dependentrelationshipcd>SON</dependentrelationshipcd>
SON

RE: Best way to process large/complex XML/schema ? - MDRI - May-16-2021

(May-15-2021, 06:14 PM)snippsat Wrote: Here a example of how i would read it and parse some data.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml')

# Just copy from and doc and lower search or write lower case
sub_id = soup.find('SubmissionId'.lower())
# Tag and text
print(sub_id)
print(sub_id.text)

#---| Take out  part eg doc 2,then <find_all> of a tag that there are several of
doc_2 = soup.find('returndata', {'documentcnt': '2'})
dep_detail =  doc_2.find_all('DependentDetail'.lower())
print('-' * 30)
print(dep_detail[0].find('dependentrelationshipcd'))
print(dep_detail[0].find('dependentrelationshipcd').text)

Output:<submissionid>00000000000000002222</submissionid>
00000000000000002222
------------------------------
<dependentrelationshipcd>SON</dependentrelationshipcd>
SON

Thanks for your guidance.

As I mentioned this is big XML, if I go with above element by element with explicit navigation. it is a hard task to pull it up.

We may have 45K to 50K xml elements to traverse this way.

Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?

Is there a way to pull elements using XPATH in lxml?

Are there any options in Python do parallel parsing like SAX (Java)?

Thanks for your guidance.

RE: Best way to process large/complex XML/schema ? - snippsat - May-16-2021

Here e demo of using lxml with xPath.

from lxml import etree

root = etree.parse('W2Testfile.xml')
sub = root.xpath('//SubmissionId')[0]
print(sub.text)
for tag in root.xpath('//DependentRelationshipCd'):
    print(tag.text)

00000000000000002222
SON
HALF BROTHER
HALF BROTHER
STEPCHILD

MDRI Wrote:We may have 45K to 50K xml elements to traverse this way.

Not that large would not cause any problems.

MDRI Wrote:Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?

lxml can operate in either mode,it just depends how your code uses it.
Can either parse the full file into a DOM, or use Sax callbacks to parse serially.

MDRI Wrote:Are there any options in Python do parallel parsing like SAX (Java)?

Yes serval ways eg concurrent.futures

Try some stuff and see how it goes.