Python Forum

Full Version: Formating generated .data file to XML
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have some generated data files I want to format to XML:

    1234=>item1:something11:
    
    something11<COMMA>item4:something12:
    
    12something<END_OF_OBJECT_LINE>
    1238=>item8:something12:
    
    something11:<END_OF_OBJECT_LINE>
    2345=>item2:something12:
    
    something11:<END_OF_OBJECT_LINE>
    123=>item1:something1:
    
    something11<COMMA>item2:something:
    
    11something<COMMA>item4:something:
    
    11something<END_OF_OBJECT_LINE>
What I Tried to do is to replace some specified regular expression to make it look like XML:

    with open("OGfile.data", "r") as f:
        with open("tempfile.data", "w") as fo:
        # formating file to XML format
            contents = f.readlines()
            contents.insert(0, "<?xml version='1.0' encoding='UTF-8'?>\n<Module>\n<Object id='")
            contents =[w.replace("<END_OF_OBJECT_LINE>\n", "'/>\n</Object>\n<Object id='") for w in contents]
            contents =[w.replace("=>", "'>\n     <Attribute name='") for w in contents]
            contents =[w.replace('<COMMA>', "'/>\n     <Attribute name='") for w in contents]
            contents =[w.replace(':something', "' value='something") for w in contents]
            # saving formated file to new file
            contents = "".join(contents)
            fo.write(contents)
    
    # fixing invalid last line from formated file with open("tempfile.data", "r") as f2:
        with open("finalfile.data", "w") as fo2:
            contents2 = f2.readlines()
            contents2 = [w.replace("<END_OF_OBJECT_LINE>", "'/>\n</Object>\n</Module>") for w in contents2]
            contents2 = "".join(contents2)
            fo2.write(contents2)
and It works fine, I made it into:

<?xml version='1.0' encoding='UTF-8'?>
    <Module>
    <Object id='1234'>
         <Attribute name='item1' value='something11:
    
    something11'/>
         <Attribute name='item4' value='something12:
    
    12something'/>
    </Object>
    <Object id='1238'>
         <Attribute name='item8' value='something12:
    
    something11:'/>
    </Object>
    <Object id='2345'>
         <Attribute name='item2' value='something12:
    
    something11:'/>
    </Object>
    <Object id='123'>
         <Attribute name='item1' value='something1:
    
    something11'/>
         <Attribute name='item2' value='something:
    
    11something'/>
         <Attribute name='item4' value='something:
    
    11something'/>
    </Object>
    </Module>
BUT, there is one problem, I am changing contents =[w.replace(':something', "' value='something") for w in contents] just by taking this value but if it would start with something different instead of "something" i would be doomed. I have been thinking about using regex to take string between "Attribute name:" and "<COMMA>" or "<END_OF_OBJECT_LINE>", but my attemps failed misserably because I am quite new into programming and python. It could be also done much better if I could somehow insert convert this .data file into dictionary and then make it into xml in proper way, but I have no idea how to separate it corretly to dictionary. Do you have any suggestions?
(Apr-13-2022, 05:04 PM)malcoverc Wrote: [ -> ]I have some generated data files I want to format to XML:

    1234=>item1:something11:
    
    something11<COMMA>item4:something12:
    
    12something<END_OF_OBJECT_LINE>
    1238=>item8:something12:
    
    something11:<END_OF_OBJECT_LINE>
    2345=>item2:something12:
    
    something11:<END_OF_OBJECT_LINE>
    123=>item1:something1:
    
    something11<COMMA>item2:something:
    
    11something<COMMA>item4:something:
    
    11something<END_OF_OBJECT_LINE>
What I Tried to do is to replace some specified regular expression to make it look like XML:

    with open("OGfile.data", "r") as f:
        with open("tempfile.data", "w") as fo:
        # formating file to XML format
            contents = f.readlines()
            contents.insert(0, "<?xml version='1.0' encoding='UTF-8'?>\n<Module>\n<Object id='")
            contents =[w.replace("<END_OF_OBJECT_LINE>\n", "'/>\n</Object>\n<Object id='") for w in contents]
            contents =[w.replace("=>", "'>\n     <Attribute name='") for w in contents]
            contents =[w.replace('<COMMA>', "'/>\n     <Attribute name='") for w in contents]
            contents =[w.replace(':something', "' value='something") for w in contents]
            # saving formated file to new file
            contents = "".join(contents)
            fo.write(contents)
    
    # fixing invalid last line from formated file with open("tempfile.data", "r") as f2:
        with open("finalfile.data", "w") as fo2:
            contents2 = f2.readlines()
            contents2 = [w.replace("<END_OF_OBJECT_LINE>", "'/>\n</Object>\n</Module>") for w in contents2]
            contents2 = "".join(contents2)
            fo2.write(contents2)
and It works fine, I made it into:

<?xml version='1.0' encoding='UTF-8'?>
    <Module>
    <Object id='1234'>
         <Attribute name='item1' value='something11:
    
    something11'/>
         <Attribute name='item4' value='something12:
    
    12something'/>
    </Object>
    <Object id='1238'>
         <Attribute name='item8' value='something12:
    
    something11:'/>
    </Object>
    <Object id='2345'>
         <Attribute name='item2' value='something12:
    
    something11:'/>
    </Object>
    <Object id='123'>
         <Attribute name='item1' value='something1:
    
    something11'/>
         <Attribute name='item2' value='something:
    
    11something'/>
         <Attribute name='item4' value='something:
    
    11something'/>
    </Object>
    </Module>
BUT, there is one problem, I am changing contents =[w.replace(':something', "' value='something") for w in contents] just by taking this value but if it would start with something different instead of "something" i would be doomed. I have been thinking about using regex to take string between "Attribute name:" and "<COMMA>" or "<END_OF_OBJECT_LINE>", but my attemps failed misserably because I am quite new into programming and python. It could be also done much better if I could somehow insert convert this .data file into dictionary and then make it into xml in proper way, but I have no idea how to separate it corretly to dictionary. Do you have any suggestions?
See section 3.3.3 of the XML definition. Be aware that it says that newlines are replaced by spaces, and then that sequences of spaces be reduced to a single space, so if I have read the spec correctly, you may not end up with what you expect to end up with. See the example table right before section 3.4.

You have not shown an example of the regular expressions you tried. The regular expression syntax is very straightforward, but the key is in using parentheses to specify the pattern you are looking for, but that is unclear since you refer to :something as your desired pattern, but there is no place in the input I see :something appearing. If you could show the string before the replace and after the replace (print statements are very good for this) as well as the pattern you are using, it would make things a lot clearer.
Quote:See section 3.3.3 of the XML definition. Be aware that it says that newlines are replaced by spaces, and then that sequences of spaces be reduced to a single space, so if I have read the spec correctly, you may not end up with what you expect to end up with. See the example table right before section 3.4.

You have not shown an example of the regular expressions you tried. The regular expression syntax is very straightforward, but the key is in using parentheses to specify the pattern you are looking for, but that is unclear since you refer to :something as your desired pattern, but there is no place in the input I see :something appearing. If you could show the string before the replace and after the replace (print statements are very good for this) as well as the pattern you are using, it would make things a lot clearer.

I know about XML parsing standarizes structure of the file and deletes some whitespaces and extra characters, but it's not so importat, but thanks for point it out.

I had an idea to use regex to take number(id) before "=>", string (attribute name) between "=>" and ":" and problem is with taking another string (value) between Attribute name and "<COMMA>" or <END_OF_THE_LINE>. But as I said, I only had an idea, I am not sure if it is a best possible solution. Another problem would accur if I somehow separate these file as I want, because I would need to save it in correct way but I never worked with regex before so it is quite difficult for me to undestand it.
Hello, I would like to make un update. I managed to find somekind of solution with regex, but I have one problem. In code below, I can only get as many arguments as many times I wrote "([^:]+):(.+?))?(?:(<COMMA>)", so the regex expects up to 2 <COMMA> instances per "record" which produces up to 3 Attribute elements, but there might be a situation when I have in my file +100 arguments separated by <COMMA>. Do you have any idea how to make it find every argument without writing a mile long line of regex ?

import re
from lxml import etree

root = etree.Element("Module")

with open("datafile.data", "r") as f:
    df = f.read() 
    result = re.finditer(r'(?s)\n?(\d{1,5})=>(?:([^:]+):(.+?))(?:(<COMMA>)([^:]+):(.+?))?(?:(<COMMA>)([^:]+):(.+?))?(<END_OF_OBJECT_LINE>\n)', df)
    for m in result:
        obj = etree.SubElement(root, "Object")
        obj.set("id", m.groups()[0])
        at = etree.SubElement(obj, "Attribute")
        at.set("name",m.groups()[1])
        at.set("value",m.groups()[2])
        for idx in range(len(m.groups())):
            if m.groups()[idx] == '<COMMA>':
                at = etree.SubElement(obj, "Attribute")
                at.set("name",m.groups()[idx + 1])
                at.set("value",m.groups()[idx + 2])

print(etree.tostring(root, pretty_print=True).decode("utf-8"))