Best way to process large/complex XML/schema ?

MDRI · May-13-2021, 01:44 AM

Best way to process large/complex XML/schema ?
==============================================

Thanks for reviewing this thread.

I like to figure out a way to process large complex XML and push the XML data to flat file or Data base.

Here is the high level view.

1) Input - Large/huge XML with complex nested/choice etc (Similar to HL7 kind)
2) The above input XML messages need to be validated against XML Schema (xsd) for schema comliant
30 The validated XML messages to be parses and extract the data .
4) The extracted data to be pushed to Flat file or Data base.

We know Python is interpreter language.

Is Pyhton the right one to do the above for performance?

What are the option we have ?

Thanks for your guidance.

***snippsat*** · May-13-2021, 03:27 PM

(May-13-2021, 01:44 AM)MDRI Wrote: We know Python is interpreter language.

Is Pyhton the right one to do the above for performance?

What are the option we have ?

Performance is no problem as eg lxml has C speed.

lxml Wrote:The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt.
It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API

(May-13-2021, 01:44 AM)MDRI Wrote: The above input XML messages need to be validated against XML Schema (xsd) for schema comliant

Validation with lxml

Quote:The validated XML messages to be parses and extract the data .

I like to use BS for parsing,still same speed as here use lxml as parser.

import requests
from bs4 import BeautifulSoup

url = 'http://httpbin.org/xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('title')
print(title)
print(title.text)

Output:<title>Wake up to WonderWidgets!</title>
Wake up to WonderWidgets!

MDRI · May-14-2021, 06:42 PM

Thanks for your response.

I like to upload a complex xml message to share here, so we can look , how the above code work.

What is the upload limitation in this forum?

I did not see an upload option . Is it feasible?

Shall I upload to google drive and share link here?

Thanks for your guidance.

***snippsat*** · May-15-2021, 02:15 AM

(May-14-2021, 06:42 PM)MDRI Wrote: Shall I upload to google drive and share link here?

That the best is file is large.
Usually can short down the xml file so have sample file to try different stuff on.
You could also to some training on smaller sample file if you new to lxml/BS.

MDRI · May-15-2021, 05:38 PM

Thanks for your reply.

I am attaching a .xml file extracted from big xml to show . It has complexly nested structure and it has its .xsd as well.

The header portion looks like

<?xml version="1.0" encoding="UTF-8"?>
 
<!--Sample XML file generated by XMLSpy v2016 rel. 2 sp1 (x64) (http://www.altova.com)-->
 
-<DataFileForEFDS>
 
<ReturnCount>0</ReturnCount>
 
 
-<DataRecord>
 
<SubmissionId>00000000000000002222</SubmissionId>
 
<DLN>12345678901234</DLN>
 
<ElectronicPostmark>2001-12-17T09:30:47Z</ElectronicPostmark>
 
<ETIN>String</ETIN>
 
<TransmitterIPAddress>String</TransmitterIPAddress>
 
<TransmitterTimestamp>2001-12-17T09:30:47Z</TransmitterTimestamp>
 
<ReservedIPAddressCd>String</ReservedIPAddressCd>
 
<MeFRuleNumber>String</MeFRuleNumber>
 
<MeFRuleNumber>String</MeFRuleNumber>
 
 
-<Return returnVersion="String">
 
 
-<ReturnData documentCnt="2">

Are you able to point me to a code snippet /template to process this XML ?

Thanks for your guidance.

***snippsat*** · May-15-2021, 06:14 PM

Here a example of how i would read it and parse some data.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml')

# Just copy from and doc and lower search or write lower case
sub_id = soup.find('SubmissionId'.lower())
# Tag and text
print(sub_id)
print(sub_id.text)

#---| Take out  part eg doc 2,then <find_all> of a tag that there are several of
doc_2 = soup.find('returndata', {'documentcnt': '2'})
dep_detail =  doc_2.find_all('DependentDetail'.lower())
print('-' * 30)
print(dep_detail[0].find('dependentrelationshipcd'))
print(dep_detail[0].find('dependentrelationshipcd').text)

Output:<submissionid>00000000000000002222</submissionid>
00000000000000002222
------------------------------
<dependentrelationshipcd>SON</dependentrelationshipcd>
SON

MDRI · May-16-2021, 05:11 PM

(May-15-2021, 06:14 PM)snippsat Wrote: Here a example of how i would read it and parse some data.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml')

# Just copy from and doc and lower search or write lower case
sub_id = soup.find('SubmissionId'.lower())
# Tag and text
print(sub_id)
print(sub_id.text)

#---| Take out  part eg doc 2,then <find_all> of a tag that there are several of
doc_2 = soup.find('returndata', {'documentcnt': '2'})
dep_detail =  doc_2.find_all('DependentDetail'.lower())
print('-' * 30)
print(dep_detail[0].find('dependentrelationshipcd'))
print(dep_detail[0].find('dependentrelationshipcd').text)

Output:<submissionid>00000000000000002222</submissionid>
00000000000000002222
------------------------------
<dependentrelationshipcd>SON</dependentrelationshipcd>
SON

Thanks for your guidance.

As I mentioned this is big XML, if I go with above element by element with explicit navigation. it is a hard task to pull it up.

We may have 45K to 50K xml elements to traverse this way.

Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?

Is there a way to pull elements using XPATH in lxml?

Are there any options in Python do parallel parsing like SAX (Java)?

Thanks for your guidance.

***snippsat*** · May-16-2021, 09:31 PM

Here e demo of using lxml with xPath.

from lxml import etree

root = etree.parse('W2Testfile.xml')
sub = root.xpath('//SubmissionId')[0]
print(sub.text)
for tag in root.xpath('//DependentRelationshipCd'):
    print(tag.text)

00000000000000002222
SON
HALF BROTHER
HALF BROTHER
STEPCHILD

MDRI Wrote:We may have 45K to 50K xml elements to traverse this way.

Not that large would not cause any problems.

MDRI Wrote:Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?

lxml can operate in either mode,it just depends how your code uses it.
Can either parse the full file into a DOM, or use Sax callbacks to parse serially.

MDRI Wrote:Are there any options in Python do parallel parsing like SAX (Java)?

Yes serval ways eg concurrent.futures

Try some stuff and see how it goes.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	__init__() got multiple values for argument 'schema'	dawid294	4	10,324	Jan-03-2024, 09:42 AM Last Post: buran
	how to catch schema error?	maiya	0	2,720	Jul-16-2021, 08:37 AM Last Post: maiya
	Missing Schema-Python Question	Andwconteh	1	3,545	Jun-16-2021, 01:00 PM Last Post: Andwconteh
	How to sharing object between multiple process from main process using Pipe	Subrata	1	4,589	Sep-03-2019, 09:49 PM Last Post: woooee
	Avoid output buffering when redirecting large data (40KB) to another process	Ramphic	3	4,613	Mar-10-2018, 04:49 AM Last Post: Larz60+

Best way to process large/complex XML/schema ?

User Panel Messages

Announcements