Python Forum
Formating generated .data file to XML
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Formating generated .data file to XML
#1
I have some generated data files I want to format to XML:

    1234=>item1:something11:
    
    something11<COMMA>item4:something12:
    
    12something<END_OF_OBJECT_LINE>
    1238=>item8:something12:
    
    something11:<END_OF_OBJECT_LINE>
    2345=>item2:something12:
    
    something11:<END_OF_OBJECT_LINE>
    123=>item1:something1:
    
    something11<COMMA>item2:something:
    
    11something<COMMA>item4:something:
    
    11something<END_OF_OBJECT_LINE>
What I Tried to do is to replace some specified regular expression to make it look like XML:

    with open("OGfile.data", "r") as f:
        with open("tempfile.data", "w") as fo:
        # formating file to XML format
            contents = f.readlines()
            contents.insert(0, "<?xml version='1.0' encoding='UTF-8'?>\n<Module>\n<Object id='")
            contents =[w.replace("<END_OF_OBJECT_LINE>\n", "'/>\n</Object>\n<Object id='") for w in contents]
            contents =[w.replace("=>", "'>\n     <Attribute name='") for w in contents]
            contents =[w.replace('<COMMA>', "'/>\n     <Attribute name='") for w in contents]
            contents =[w.replace(':something', "' value='something") for w in contents]
            # saving formated file to new file
            contents = "".join(contents)
            fo.write(contents)
    
    # fixing invalid last line from formated file with open("tempfile.data", "r") as f2:
        with open("finalfile.data", "w") as fo2:
            contents2 = f2.readlines()
            contents2 = [w.replace("<END_OF_OBJECT_LINE>", "'/>\n</Object>\n</Module>") for w in contents2]
            contents2 = "".join(contents2)
            fo2.write(contents2)
and It works fine, I made it into:

<?xml version='1.0' encoding='UTF-8'?>
    <Module>
    <Object id='1234'>
         <Attribute name='item1' value='something11:
    
    something11'/>
         <Attribute name='item4' value='something12:
    
    12something'/>
    </Object>
    <Object id='1238'>
         <Attribute name='item8' value='something12:
    
    something11:'/>
    </Object>
    <Object id='2345'>
         <Attribute name='item2' value='something12:
    
    something11:'/>
    </Object>
    <Object id='123'>
         <Attribute name='item1' value='something1:
    
    something11'/>
         <Attribute name='item2' value='something:
    
    11something'/>
         <Attribute name='item4' value='something:
    
    11something'/>
    </Object>
    </Module>
BUT, there is one problem, I am changing contents =[w.replace(':something', "' value='something") for w in contents] just by taking this value but if it would start with something different instead of "something" i would be doomed. I have been thinking about using regex to take string between "Attribute name:" and "<COMMA>" or "<END_OF_OBJECT_LINE>", but my attemps failed misserably because I am quite new into programming and python. It could be also done much better if I could somehow insert convert this .data file into dictionary and then make it into xml in proper way, but I have no idea how to separate it corretly to dictionary. Do you have any suggestions?
Reply
#2
(Apr-13-2022, 05:04 PM)malcoverc Wrote: I have some generated data files I want to format to XML:

    1234=>item1:something11:
    
    something11<COMMA>item4:something12:
    
    12something<END_OF_OBJECT_LINE>
    1238=>item8:something12:
    
    something11:<END_OF_OBJECT_LINE>
    2345=>item2:something12:
    
    something11:<END_OF_OBJECT_LINE>
    123=>item1:something1:
    
    something11<COMMA>item2:something:
    
    11something<COMMA>item4:something:
    
    11something<END_OF_OBJECT_LINE>
What I Tried to do is to replace some specified regular expression to make it look like XML:

    with open("OGfile.data", "r") as f:
        with open("tempfile.data", "w") as fo:
        # formating file to XML format
            contents = f.readlines()
            contents.insert(0, "<?xml version='1.0' encoding='UTF-8'?>\n<Module>\n<Object id='")
            contents =[w.replace("<END_OF_OBJECT_LINE>\n", "'/>\n</Object>\n<Object id='") for w in contents]
            contents =[w.replace("=>", "'>\n     <Attribute name='") for w in contents]
            contents =[w.replace('<COMMA>', "'/>\n     <Attribute name='") for w in contents]
            contents =[w.replace(':something', "' value='something") for w in contents]
            # saving formated file to new file
            contents = "".join(contents)
            fo.write(contents)
    
    # fixing invalid last line from formated file with open("tempfile.data", "r") as f2:
        with open("finalfile.data", "w") as fo2:
            contents2 = f2.readlines()
            contents2 = [w.replace("<END_OF_OBJECT_LINE>", "'/>\n</Object>\n</Module>") for w in contents2]
            contents2 = "".join(contents2)
            fo2.write(contents2)
and It works fine, I made it into:

<?xml version='1.0' encoding='UTF-8'?>
    <Module>
    <Object id='1234'>
         <Attribute name='item1' value='something11:
    
    something11'/>
         <Attribute name='item4' value='something12:
    
    12something'/>
    </Object>
    <Object id='1238'>
         <Attribute name='item8' value='something12:
    
    something11:'/>
    </Object>
    <Object id='2345'>
         <Attribute name='item2' value='something12:
    
    something11:'/>
    </Object>
    <Object id='123'>
         <Attribute name='item1' value='something1:
    
    something11'/>
         <Attribute name='item2' value='something:
    
    11something'/>
         <Attribute name='item4' value='something:
    
    11something'/>
    </Object>
    </Module>
BUT, there is one problem, I am changing contents =[w.replace(':something', "' value='something") for w in contents] just by taking this value but if it would start with something different instead of "something" i would be doomed. I have been thinking about using regex to take string between "Attribute name:" and "<COMMA>" or "<END_OF_OBJECT_LINE>", but my attemps failed misserably because I am quite new into programming and python. It could be also done much better if I could somehow insert convert this .data file into dictionary and then make it into xml in proper way, but I have no idea how to separate it corretly to dictionary. Do you have any suggestions?
See section 3.3.3 of the XML definition. Be aware that it says that newlines are replaced by spaces, and then that sequences of spaces be reduced to a single space, so if I have read the spec correctly, you may not end up with what you expect to end up with. See the example table right before section 3.4.

You have not shown an example of the regular expressions you tried. The regular expression syntax is very straightforward, but the key is in using parentheses to specify the pattern you are looking for, but that is unclear since you refer to :something as your desired pattern, but there is no place in the input I see :something appearing. If you could show the string before the replace and after the replace (print statements are very good for this) as well as the pattern you are using, it would make things a lot clearer.
Reply
#3
Quote:See section 3.3.3 of the XML definition. Be aware that it says that newlines are replaced by spaces, and then that sequences of spaces be reduced to a single space, so if I have read the spec correctly, you may not end up with what you expect to end up with. See the example table right before section 3.4.

You have not shown an example of the regular expressions you tried. The regular expression syntax is very straightforward, but the key is in using parentheses to specify the pattern you are looking for, but that is unclear since you refer to :something as your desired pattern, but there is no place in the input I see :something appearing. If you could show the string before the replace and after the replace (print statements are very good for this) as well as the pattern you are using, it would make things a lot clearer.

I know about XML parsing standarizes structure of the file and deletes some whitespaces and extra characters, but it's not so importat, but thanks for point it out.

I had an idea to use regex to take number(id) before "=>", string (attribute name) between "=>" and ":" and problem is with taking another string (value) between Attribute name and "<COMMA>" or <END_OF_THE_LINE>. But as I said, I only had an idea, I am not sure if it is a best possible solution. Another problem would accur if I somehow separate these file as I want, because I would need to save it in correct way but I never worked with regex before so it is quite difficult for me to undestand it.
Reply
#4
Hello, I would like to make un update. I managed to find somekind of solution with regex, but I have one problem. In code below, I can only get as many arguments as many times I wrote "([^:]+):(.+?))?(?:(<COMMA>)", so the regex expects up to 2 <COMMA> instances per "record" which produces up to 3 Attribute elements, but there might be a situation when I have in my file +100 arguments separated by <COMMA>. Do you have any idea how to make it find every argument without writing a mile long line of regex ?

import re
from lxml import etree

root = etree.Element("Module")

with open("datafile.data", "r") as f:
    df = f.read() 
    result = re.finditer(r'(?s)\n?(\d{1,5})=>(?:([^:]+):(.+?))(?:(<COMMA>)([^:]+):(.+?))?(?:(<COMMA>)([^:]+):(.+?))?(<END_OF_OBJECT_LINE>\n)', df)
    for m in result:
        obj = etree.SubElement(root, "Object")
        obj.set("id", m.groups()[0])
        at = etree.SubElement(obj, "Attribute")
        at.set("name",m.groups()[1])
        at.set("value",m.groups()[2])
        for idx in range(len(m.groups())):
            if m.groups()[idx] == '<COMMA>':
                at = etree.SubElement(obj, "Attribute")
                at.set("name",m.groups()[idx + 1])
                at.set("value",m.groups()[idx + 2])

print(etree.tostring(root, pretty_print=True).decode("utf-8"))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HOW TO USE C# GENERATED DLL davide_vergnani 2 1,662 Jun-12-2023, 03:35 PM
Last Post: davide_vergnani
  propper formating paracelsusx 2 1,910 Jul-16-2021, 09:17 AM
Last Post: perfringo
  Adding graph points and formating project_science 4 2,407 Jan-24-2021, 05:02 PM
Last Post: project_science
  xml file creation from an XML file template and data from an excel file naji_python 1 2,125 Dec-21-2020, 03:24 PM
Last Post: Gribouillis
  How do you work with procedurally generated data? rbbauer00 1 1,487 Jul-08-2020, 04:21 AM
Last Post: ndc85430
  How to save CSV file data into the Azure Data Lake Storage Gen2 table? Mangesh121 0 2,117 Jun-26-2020, 11:59 AM
Last Post: Mangesh121
  Excel: Apply formating of a sheet(file1) to another sheet(file2) lowermoon 1 2,059 May-26-2020, 07:57 AM
Last Post: buran
  the exe file by generated by pyinstaller ,can't get the PYTHONPATH roger2020 11 7,041 Jan-14-2020, 11:07 AM
Last Post: roger2020
  tuple and formating problem darktitan 7 3,418 Feb-17-2019, 07:37 PM
Last Post: marienbad
  Use Variables Generated from Functions in different files to use on the main file AykutRobotics 3 2,952 Jan-01-2019, 04:19 PM
Last Post: AykutRobotics

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020