Formating generated .data file to XML

malcoverc · Apr-13-2022, 05:04 PM

I have some generated data files I want to format to XML:

    1234=>item1:something11:
    
    something11<COMMA>item4:something12:
    
    12something<END_OF_OBJECT_LINE>
    1238=>item8:something12:
    
    something11:<END_OF_OBJECT_LINE>
    2345=>item2:something12:
    
    something11:<END_OF_OBJECT_LINE>
    123=>item1:something1:
    
    something11<COMMA>item2:something:
    
    11something<COMMA>item4:something:
    
    11something<END_OF_OBJECT_LINE>

What I Tried to do is to replace some specified regular expression to make it look like XML:

    with open("OGfile.data", "r") as f:
        with open("tempfile.data", "w") as fo:
        # formating file to XML format
            contents = f.readlines()
            contents.insert(0, "<?xml version='1.0' encoding='UTF-8'?>\n<Module>\n<Object id='")
            contents =[w.replace("<END_OF_OBJECT_LINE>\n", "'/>\n</Object>\n<Object id='") for w in contents]
            contents =[w.replace("=>", "'>\n     <Attribute name='") for w in contents]
            contents =[w.replace('<COMMA>', "'/>\n     <Attribute name='") for w in contents]
            contents =[w.replace(':something', "' value='something") for w in contents]
            # saving formated file to new file
            contents = "".join(contents)
            fo.write(contents)
    
    # fixing invalid last line from formated file with open("tempfile.data", "r") as f2:
        with open("finalfile.data", "w") as fo2:
            contents2 = f2.readlines()
            contents2 = [w.replace("<END_OF_OBJECT_LINE>", "'/>\n</Object>\n</Module>") for w in contents2]
            contents2 = "".join(contents2)
            fo2.write(contents2)

and It works fine, I made it into:

<?xml version='1.0' encoding='UTF-8'?>
    <Module>
    <Object id='1234'>
         <Attribute name='item1' value='something11:
    
    something11'/>
         <Attribute name='item4' value='something12:
    
    12something'/>
    </Object>
    <Object id='1238'>
         <Attribute name='item8' value='something12:
    
    something11:'/>
    </Object>
    <Object id='2345'>
         <Attribute name='item2' value='something12:
    
    something11:'/>
    </Object>
    <Object id='123'>
         <Attribute name='item1' value='something1:
    
    something11'/>
         <Attribute name='item2' value='something:
    
    11something'/>
         <Attribute name='item4' value='something:
    
    11something'/>
    </Object>
    </Module>

BUT, there is one problem, I am changing contents =[w.replace(':something', "' value='something") for w in contents] just by taking this value but if it would start with something different instead of "something" i would be doomed. I have been thinking about using regex to take string between "Attribute name:" and "<COMMA>" or "<END_OF_OBJECT_LINE>", but my attemps failed misserably because I am quite new into programming and python. It could be also done much better if I could somehow insert convert this .data file into dictionary and then make it into xml in proper way, but I have no idea how to separate it corretly to dictionary. Do you have any suggestions?

supuflounder · Apr-13-2022, 08:29 PM

(Apr-13-2022, 05:04 PM)malcoverc Wrote: I have some generated data files I want to format to XML:

    1234=>item1:something11:
    
    something11<COMMA>item4:something12:
    
    12something<END_OF_OBJECT_LINE>
    1238=>item8:something12:
    
    something11:<END_OF_OBJECT_LINE>
    2345=>item2:something12:
    
    something11:<END_OF_OBJECT_LINE>
    123=>item1:something1:
    
    something11<COMMA>item2:something:
    
    11something<COMMA>item4:something:
    
    11something<END_OF_OBJECT_LINE>

What I Tried to do is to replace some specified regular expression to make it look like XML:

    with open("OGfile.data", "r") as f:
        with open("tempfile.data", "w") as fo:
        # formating file to XML format
            contents = f.readlines()
            contents.insert(0, "<?xml version='1.0' encoding='UTF-8'?>\n<Module>\n<Object id='")
            contents =[w.replace("<END_OF_OBJECT_LINE>\n", "'/>\n</Object>\n<Object id='") for w in contents]
            contents =[w.replace("=>", "'>\n     <Attribute name='") for w in contents]
            contents =[w.replace('<COMMA>', "'/>\n     <Attribute name='") for w in contents]
            contents =[w.replace(':something', "' value='something") for w in contents]
            # saving formated file to new file
            contents = "".join(contents)
            fo.write(contents)
    
    # fixing invalid last line from formated file with open("tempfile.data", "r") as f2:
        with open("finalfile.data", "w") as fo2:
            contents2 = f2.readlines()
            contents2 = [w.replace("<END_OF_OBJECT_LINE>", "'/>\n</Object>\n</Module>") for w in contents2]
            contents2 = "".join(contents2)
            fo2.write(contents2)

and It works fine, I made it into:

<?xml version='1.0' encoding='UTF-8'?>
    <Module>
    <Object id='1234'>
         <Attribute name='item1' value='something11:
    
    something11'/>
         <Attribute name='item4' value='something12:
    
    12something'/>
    </Object>
    <Object id='1238'>
         <Attribute name='item8' value='something12:
    
    something11:'/>
    </Object>
    <Object id='2345'>
         <Attribute name='item2' value='something12:
    
    something11:'/>
    </Object>
    <Object id='123'>
         <Attribute name='item1' value='something1:
    
    something11'/>
         <Attribute name='item2' value='something:
    
    11something'/>
         <Attribute name='item4' value='something:
    
    11something'/>
    </Object>
    </Module>

BUT, there is one problem, I am changing contents =[w.replace(':something', "' value='something") for w in contents] just by taking this value but if it would start with something different instead of "something" i would be doomed. I have been thinking about using regex to take string between "Attribute name:" and "<COMMA>" or "<END_OF_OBJECT_LINE>", but my attemps failed misserably because I am quite new into programming and python. It could be also done much better if I could somehow insert convert this .data file into dictionary and then make it into xml in proper way, but I have no idea how to separate it corretly to dictionary. Do you have any suggestions?

See section 3.3.3 of the XML definition. Be aware that it says that newlines are replaced by spaces, and then that sequences of spaces be reduced to a single space, so if I have read the spec correctly, you may not end up with what you expect to end up with. See the example table right before section 3.4.

You have not shown an example of the regular expressions you tried. The regular expression syntax is very straightforward, but the key is in using parentheses to specify the pattern you are looking for, but that is unclear since you refer to :something as your desired pattern, but there is no place in the input I see :something appearing. If you could show the string before the replace and after the replace (print statements are very good for this) as well as the pattern you are using, it would make things a lot clearer.

malcoverc · Apr-13-2022, 09:16 PM

Quote:See section 3.3.3 of the XML definition. Be aware that it says that newlines are replaced by spaces, and then that sequences of spaces be reduced to a single space, so if I have read the spec correctly, you may not end up with what you expect to end up with. See the example table right before section 3.4.

You have not shown an example of the regular expressions you tried. The regular expression syntax is very straightforward, but the key is in using parentheses to specify the pattern you are looking for, but that is unclear since you refer to :something as your desired pattern, but there is no place in the input I see :something appearing. If you could show the string before the replace and after the replace (print statements are very good for this) as well as the pattern you are using, it would make things a lot clearer.

I know about XML parsing standarizes structure of the file and deletes some whitespaces and extra characters, but it's not so importat, but thanks for point it out.

I had an idea to use regex to take number(id) before "=>", string (attribute name) between "=>" and ":" and problem is with taking another string (value) between Attribute name and "<COMMA>" or <END_OF_THE_LINE>. But as I said, I only had an idea, I am not sure if it is a best possible solution. Another problem would accur if I somehow separate these file as I want, because I would need to save it in correct way but I never worked with regex before so it is quite difficult for me to undestand it.

malcoverc · (This post was last modified: Apr-14-2022, 09:41 PM by malcoverc.)

Hello, I would like to make un update. I managed to find somekind of solution with regex, but I have one problem. In code below, I can only get as many arguments as many times I wrote "([^:]+):(.+?))?(?:(<COMMA>)", so the regex expects up to 2 <COMMA> instances per "record" which produces up to 3 Attribute elements, but there might be a situation when I have in my file +100 arguments separated by <COMMA>. Do you have any idea how to make it find every argument without writing a mile long line of regex ?

import re
from lxml import etree

root = etree.Element("Module")

with open("datafile.data", "r") as f:
    df = f.read() 
    result = re.finditer(r'(?s)\n?(\d{1,5})=>(?:([^:]+):(.+?))(?:(<COMMA>)([^:]+):(.+?))?(?:(<COMMA>)([^:]+):(.+?))?(<END_OF_OBJECT_LINE>\n)', df)
    for m in result:
        obj = etree.SubElement(root, "Object")
        obj.set("id", m.groups()[0])
        at = etree.SubElement(obj, "Attribute")
        at.set("name",m.groups()[1])
        at.set("value",m.groups()[2])
        for idx in range(len(m.groups())):
            if m.groups()[idx] == '<COMMA>':
                at = etree.SubElement(obj, "Attribute")
                at.set("name",m.groups()[idx + 1])
                at.set("value",m.groups()[idx + 2])

print(etree.tostring(root, pretty_print=True).decode("utf-8"))

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	HOW TO USE C# GENERATED DLL	davide_vergnani	2	1,662	Jun-12-2023, 03:35 PM Last Post: davide_vergnani
	propper formating	paracelsusx	2	1,910	Jul-16-2021, 09:17 AM Last Post: perfringo
	Adding graph points and formating	project_science	4	2,407	Jan-24-2021, 05:02 PM Last Post: project_science
	xml file creation from an XML file template and data from an excel file	naji_python	1	2,125	Dec-21-2020, 03:24 PM Last Post: Gribouillis
	How do you work with procedurally generated data?	rbbauer00	1	1,487	Jul-08-2020, 04:21 AM Last Post: ndc85430
	How to save CSV file data into the Azure Data Lake Storage Gen2 table?	Mangesh121	0	2,117	Jun-26-2020, 11:59 AM Last Post: Mangesh121
	Excel: Apply formating of a sheet(file1) to another sheet(file2)	lowermoon	1	2,059	May-26-2020, 07:57 AM Last Post: buran
	the exe file by generated by pyinstaller ,can't get the PYTHONPATH	roger2020	11	7,041	Jan-14-2020, 11:07 AM Last Post: roger2020
	tuple and formating problem	darktitan	7	3,418	Feb-17-2019, 07:37 PM Last Post: marienbad
	Use Variables Generated from Functions in different files to use on the main file	AykutRobotics	3	2,952	Jan-01-2019, 04:19 PM Last Post: AykutRobotics

Formating generated .data file to XML

User Panel Messages

Announcements