Python Forum
Best way to process large/complex XML/schema ?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Best way to process large/complex XML/schema ?
#1
Best way to process large/complex XML/schema ?
==============================================

Thanks for reviewing this thread.

I like to figure out a way to process large complex XML and push the XML data to flat file or Data base.

Here is the high level view.

1) Input - Large/huge XML with complex nested/choice etc (Similar to HL7 kind)
2) The above input XML messages need to be validated against XML Schema (xsd) for schema comliant
30 The validated XML messages to be parses and extract the data .
4) The extracted data to be pushed to Flat file or Data base.
We know Python is interpreter language.

Is Pyhton the right one to do the above for performance?

What are the option we have ?

Thanks for your guidance.
Reply
#2
(May-13-2021, 01:44 AM)MDRI Wrote: We know Python is interpreter language.

Is Pyhton the right one to do the above for performance?

What are the option we have ?
Performance is no problem as eg lxml has C speed.
lxml Wrote:The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt.
It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API
(May-13-2021, 01:44 AM)MDRI Wrote: The above input XML messages need to be validated against XML Schema (xsd) for schema comliant
Validation with lxml

Quote:The validated XML messages to be parses and extract the data .
I like to use BS for parsing,still same speed as here use lxml as parser.
import requests
from bs4 import BeautifulSoup

url = 'http://httpbin.org/xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('title')
print(title)
print(title.text)
Output:
<title>Wake up to WonderWidgets!</title> Wake up to WonderWidgets!
Reply
#3
Thanks for your response.

I like to upload a complex xml message to share here, so we can look , how the above code work.

What is the upload limitation in this forum?

I did not see an upload option . Is it feasible?

Shall I upload to google drive and share link here?

Thanks for your guidance.
Reply
#4
(May-14-2021, 06:42 PM)MDRI Wrote: Shall I upload to google drive and share link here?
That the best is file is large.
Usually can short down the xml file so have sample file to try different stuff on.
You could also to some training on smaller sample file if you new to lxml/BS.
Reply
#5
Thanks for your reply.

I am attaching a .xml file extracted from big xml to show . It has complexly nested structure and it has its .xsd as well.

The header portion looks like

<?xml version="1.0" encoding="UTF-8"?>
 
<!--Sample XML file generated by XMLSpy v2016 rel. 2 sp1 (x64) (http://www.altova.com)-->
 
-<DataFileForEFDS>
 
<ReturnCount>0</ReturnCount>
 
 
-<DataRecord>
 
<SubmissionId>00000000000000002222</SubmissionId>
 
<DLN>12345678901234</DLN>
 
<ElectronicPostmark>2001-12-17T09:30:47Z</ElectronicPostmark>
 
<ETIN>String</ETIN>
 
<TransmitterIPAddress>String</TransmitterIPAddress>
 
<TransmitterTimestamp>2001-12-17T09:30:47Z</TransmitterTimestamp>
 
<ReservedIPAddressCd>String</ReservedIPAddressCd>
 
<MeFRuleNumber>String</MeFRuleNumber>
 
<MeFRuleNumber>String</MeFRuleNumber>
 
 
-<Return returnVersion="String">
 
 
-<ReturnData documentCnt="2">
Are you able to point me to a code snippet /template to process this XML ?

Thanks for your guidance.
Reply
#6
Here a example of how i would read it and parse some data.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml')

# Just copy from and doc and lower search or write lower case
sub_id = soup.find('SubmissionId'.lower())
# Tag and text
print(sub_id)
print(sub_id.text)

#---| Take out  part eg doc 2,then <find_all> of a tag that there are several of
doc_2 = soup.find('returndata', {'documentcnt': '2'})
dep_detail =  doc_2.find_all('DependentDetail'.lower())
print('-' * 30)
print(dep_detail[0].find('dependentrelationshipcd'))
print(dep_detail[0].find('dependentrelationshipcd').text)
Output:
<submissionid>00000000000000002222</submissionid> 00000000000000002222 ------------------------------ <dependentrelationshipcd>SON</dependentrelationshipcd> SON
Reply
#7
(May-15-2021, 06:14 PM)snippsat Wrote: Here a example of how i would read it and parse some data.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml')

# Just copy from and doc and lower search or write lower case
sub_id = soup.find('SubmissionId'.lower())
# Tag and text
print(sub_id)
print(sub_id.text)

#---| Take out  part eg doc 2,then <find_all> of a tag that there are several of
doc_2 = soup.find('returndata', {'documentcnt': '2'})
dep_detail =  doc_2.find_all('DependentDetail'.lower())
print('-' * 30)
print(dep_detail[0].find('dependentrelationshipcd'))
print(dep_detail[0].find('dependentrelationshipcd').text)
Output:
<submissionid>00000000000000002222</submissionid> 00000000000000002222 ------------------------------ <dependentrelationshipcd>SON</dependentrelationshipcd> SON

Thanks for your guidance.

As I mentioned this is big XML, if I go with above element by element with explicit navigation. it is a hard task to pull it up.

We may have 45K to 50K xml elements to traverse this way.

Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?

Is there a way to pull elements using XPATH in lxml?

Are there any options in Python do parallel parsing like SAX (Java)?

Thanks for your guidance.
Reply
#8
Here e demo of using lxml with xPath.
from lxml import etree

root = etree.parse('W2Testfile.xml')
sub = root.xpath('//SubmissionId')[0]
print(sub.text)
for tag in root.xpath('//DependentRelationshipCd'):
    print(tag.text)
00000000000000002222
SON
HALF BROTHER
HALF BROTHER
STEPCHILD
MDRI Wrote:We may have 45K to 50K xml elements to traverse this way.
Not that large would not cause any problems.
MDRI Wrote:Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?
lxml can operate in either mode,it just depends how your code uses it.
Can either parse the full file into a DOM, or use Sax callbacks to parse serially.
MDRI Wrote:Are there any options in Python do parallel parsing like SAX (Java)?
Yes serval ways eg concurrent.futures

Try some stuff and see how it goes.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  __init__() got multiple values for argument 'schema' dawid294 4 1,883 Jan-03-2024, 09:42 AM
Last Post: buran
  how to catch schema error? maiya 0 1,808 Jul-16-2021, 08:37 AM
Last Post: maiya
  Missing Schema-Python Question Andwconteh 1 2,457 Jun-16-2021, 01:00 PM
Last Post: Andwconteh
  How to sharing object between multiple process from main process using Pipe Subrata 1 3,617 Sep-03-2019, 09:49 PM
Last Post: woooee
  Avoid output buffering when redirecting large data (40KB) to another process Ramphic 3 3,344 Mar-10-2018, 04:49 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020