Best way to process large/complex XML/schema ? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Best way to process large/complex XML/schema ? (/thread-33644.html) |
Best way to process large/complex XML/schema ? - MDRI - May-13-2021 Best way to process large/complex XML/schema ? ============================================== Thanks for reviewing this thread. I like to figure out a way to process large complex XML and push the XML data to flat file or Data base. Here is the high level view. 1) Input - Large/huge XML with complex nested/choice etc (Similar to HL7 kind) 2) The above input XML messages need to be validated against XML Schema (xsd) for schema comliant 30 The validated XML messages to be parses and extract the data . 4) The extracted data to be pushed to Flat file or Data base.We know Python is interpreter language. Is Pyhton the right one to do the above for performance? What are the option we have ? Thanks for your guidance. RE: Best way to process large/complex XML/schema ? - snippsat - May-13-2021 (May-13-2021, 01:44 AM)MDRI Wrote: We know Python is interpreter language.Performance is no problem as eg lxml has C speed. lxml Wrote:The lxml XML toolkit is a Pythonic binding for the (May-13-2021, 01:44 AM)MDRI Wrote: The above input XML messages need to be validated against XML Schema (xsd) for schema comliantValidation with lxml Quote:The validated XML messages to be parses and extract the data .I like to use BS for parsing,still same speed as here use lxml as parser. import requests from bs4 import BeautifulSoup url = 'http://httpbin.org/xml' response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') title = soup.select_one('title') print(title) print(title.text)
RE: Best way to process large/complex XML/schema ? - MDRI - May-14-2021 Thanks for your response. I like to upload a complex xml message to share here, so we can look , how the above code work. What is the upload limitation in this forum? I did not see an upload option . Is it feasible? Shall I upload to google drive and share link here? Thanks for your guidance. RE: Best way to process large/complex XML/schema ? - snippsat - May-15-2021 (May-14-2021, 06:42 PM)MDRI Wrote: Shall I upload to google drive and share link here?That the best is file is large. Usually can short down the xml file so have sample file to try different stuff on. You could also to some training on smaller sample file if you new to lxml/BS. RE: Best way to process large/complex XML/schema ? - MDRI - May-15-2021 Thanks for your reply. I am attaching a .xml file extracted from big xml to show . It has complexly nested structure and it has its .xsd as well. The header portion looks like <?xml version="1.0" encoding="UTF-8"?> <!--Sample XML file generated by XMLSpy v2016 rel. 2 sp1 (x64) (http://www.altova.com)--> -<DataFileForEFDS> <ReturnCount>0</ReturnCount> -<DataRecord> <SubmissionId>00000000000000002222</SubmissionId> <DLN>12345678901234</DLN> <ElectronicPostmark>2001-12-17T09:30:47Z</ElectronicPostmark> <ETIN>String</ETIN> <TransmitterIPAddress>String</TransmitterIPAddress> <TransmitterTimestamp>2001-12-17T09:30:47Z</TransmitterTimestamp> <ReservedIPAddressCd>String</ReservedIPAddressCd> <MeFRuleNumber>String</MeFRuleNumber> <MeFRuleNumber>String</MeFRuleNumber> -<Return returnVersion="String"> -<ReturnData documentCnt="2">Are you able to point me to a code snippet /template to process this XML ? Thanks for your guidance. RE: Best way to process large/complex XML/schema ? - snippsat - May-15-2021 Here a example of how i would read it and parse some data. from bs4 import BeautifulSoup soup = BeautifulSoup(open('W2Testfile.xml', encoding='utf-8'), 'lxml') # Just copy from and doc and lower search or write lower case sub_id = soup.find('SubmissionId'.lower()) # Tag and text print(sub_id) print(sub_id.text) #---| Take out part eg doc 2,then <find_all> of a tag that there are several of doc_2 = soup.find('returndata', {'documentcnt': '2'}) dep_detail = doc_2.find_all('DependentDetail'.lower()) print('-' * 30) print(dep_detail[0].find('dependentrelationshipcd')) print(dep_detail[0].find('dependentrelationshipcd').text)
RE: Best way to process large/complex XML/schema ? - MDRI - May-16-2021 (May-15-2021, 06:14 PM)snippsat Wrote: Here a example of how i would read it and parse some data. Thanks for your guidance. As I mentioned this is big XML, if I go with above element by element with explicit navigation. it is a hard task to pull it up. We may have 45K to 50K xml elements to traverse this way. Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM? Is there a way to pull elements using XPATH in lxml? Are there any options in Python do parallel parsing like SAX (Java)? Thanks for your guidance. RE: Best way to process large/complex XML/schema ? - snippsat - May-16-2021 Here e demo of using lxml with xPath. from lxml import etree root = etree.parse('W2Testfile.xml') sub = root.xpath('//SubmissionId')[0] print(sub.text) for tag in root.xpath('//DependentRelationshipCd'): print(tag.text) 00000000000000002222 SON HALF BROTHER HALF BROTHER STEPCHILD MDRI Wrote:We may have 45K to 50K xml elements to traverse this way.Not that large would not cause any problems. MDRI Wrote:Is lxml work as DOM serial parsing ? How will address of pulling all this big XML into DOM?lxml can operate in either mode,it just depends how your code uses it. Can either parse the full file into a DOM, or use Sax callbacks to parse serially. MDRI Wrote:Are there any options in Python do parallel parsing like SAX (Java)?Yes serval ways eg concurrent.futures Try some stuff and see how it goes. |