Problem formatting output text

aj347 · Sep-09-2017, 04:03 AM

The following text is a small part of a large (4MB) file that I am trying pull text
from and save it to a csv file to open in Excel.

<Tag Name="C0000" TagType="Base" DataType="DINT" Radix="Decimal" Constant="false" ExternalAccess="Read/Write">
<Description>
<![CDATA[System
Signals]]>
</Description>
<Comments>
<Comment Operand=".0">
<![CDATA[System
Always OFF
Bit]]>
</Comment>
<Comment Operand=".1">
<![CDATA[System
Always ON
Bit]]>
</Comment>
<Comment Operand=".2">
<![CDATA[System
Simulation State]]>
</Comment>
...etc.

The bold face chars are what I want to put together into another file.
The .0, .1, .2 etc goes on to a max of .31 but can end before that.
The Tag Name "C0000" is just text and usually a descriptive name.
After .31 or what ever another Tag Name (ie. "C0001", "Timer", etc) and then we do it again 0 to 31.

So far I've got a beginning but I am now struggling. Huh

#open file to read
with open('E:\\ab-test.txt') as fo:
    #read a line
    for x in fo.read().split("\n"):
        #if we find 'Tag Name' get what follows
        if (re.findall('Tag Name',x)):
            print x[11:16],
        #now look for CDATA and get text up to ]
        if (re.findall('CDATA',x)):
            print x[9:]
        
    fo.close()

Output:C0000 System
System
.
.
System

What I want is to have the output look like the following:

Output:C0000    System Signals
C0000.0 , System Always OFF Bit
C0000.1 , System Always ON Bit
C0000.2 , System Simulation State

Once I have the output done it will be saved to newfile.csv
The forum states to give as much info as possible I hope I have done this.
I'm using python2.7 on a pc.

Thanks for reading to the end.
And thanks for any help.

***ichabod801*** · Sep-09-2017, 01:20 PM

Regular expression are not a good way to search XML. Use a dedicated XML parser like lxml.

***snippsat*** · Sep-09-2017, 02:29 PM

Use a parser a mention,for CDATA is in fact html.parser fine to to use in BS.
Can plug in parsers into BS,usually i use lxml as parser.
Example:

from bs4 import BeautifulSoup

cdata = '''\
<Tag Name="C0000" TagType="Base" DataType="DINT" Radix="Decimal" Constant="false" ExternalAccess="Read/Write">
    <Description>
        <![CDATA[System Signals]]>
    </Description>
    <Comments>
        <Comment Operand=".0">
            <![CDATA[System Always OFFBit]]>
        </Comment>
        <Comment Operand=".1">
            <![CDATA[System Always ONBit]]>
        </Comment>
        <Comment Operand=".2">
            <![CDATA[System Simulation State]]>
        </Comment>
     </Comments>'''

soup = BeautifulSoup(cdata, 'html.parser')

Test:

>>> soup.find_all('comment')
[<comment operand=".0">
<![CDATA[System Always OFFBit]]>
</comment>,
 <comment operand=".1">
<![CDATA[System Always ONBit]]>
</comment>,
 <comment operand=".2">
<![CDATA[System Simulation State]]>
</comment>]

>>> soup.find_all('comment')[1].text.strip()
'System Always ONBit'
>>> soup.find_all('comment')[2].text.strip()
'System Simulation State'
>>>

Wanted output:

>>> all_comment = soup.find_all('comment')
>>> for index,item in enumerate(all_comment):
...     print('C0000.{}'.format(index), item.text.strip())
... 
C0000.0 System Always OFFBit
C0000.1 System Always ONBit
C0000.2 System Simulation State

aj347 · Sep-10-2017, 03:15 PM

Thanks to both for your replies.
I have come across BeautifulSoup but never tried it and didn't know about lxml, that I'm looking into now.

I will try to reply with a bit more detailed info of my problem, hopefully that will clear up what I am trying to do.
I'm not a programmer that's #1 issue and started using Python when I got my RaspberryPi.

Question, XML file is still basically a text file, isn't it?

I was hoping for a way that doesn't add more challenges (as in add-on's).

However, I do well with YouTube and hitting pause at each step to allow it to sink in.

Cheers

***snippsat*** · (This post was last modified: Sep-10-2017, 04:27 PM by snippsat.)

(Sep-10-2017, 03:15 PM)aj347 Wrote: Question, XML file is still basically a text file, isn't it?
was hoping for a way that doesn't add more challenges (as in add-on's).

It's a text file with markup Language,which XML is.
Using "add.on's" 3-party libraries is a important part of using and learning Python.
Using regex on XML/HTML is wrong and difficult.
Now is CDATA a special case and regex can work better for that.

Quote:CDATA stands for Character Data and it means that the data in between these strings includes data,
that could be interpreted as XML markup, but should not be.

Python has pip build in on all newer Python versions,
this make it simple to install 3-party libraries.
pip install beautifulsoup4 lxml
That's all then you have BS and lxml.

I have two part tutorial where i use BS and lxml for parsing.
part1, part2.

**nilamo** · Sep-10-2017, 04:54 PM

(Sep-10-2017, 03:15 PM)aj347 Wrote: Question, XML file is still basically a text file, isn't it?

I was hoping for a way that doesn't add more challenges (as in add-on's).

xml is just a plain text file. However, that also means there's no requirements about where whitespace is. If you try doing it yourself, using the method you've been using, then you'll never get data if the source changes to look like this:

<Tag
  Name="C0000"
  TagType="Base"
  DataType="DINT"
  Radix="Decimal"
  Constant="false"
  ExternalAccess="Read/Write">
<Description>
<![CDATA[
System Signals
]]>
</Description>
</Tag>

An xml parser would be able to handle that just fine, though. If you just want to do something fast, then using a module to do most of the work for you is almost always going to be the better option.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Formatting Output After Web Scraping	yoitspython	3	2,888	Aug-01-2019, 01:22 PM Last Post: snippsat
	Formatting Output after Web Scrape	yoitspython	2	2,461	Jul-30-2019, 08:39 PM Last Post: yoitspython
	html to text problem	Kyle	4	5,591	Apr-27-2018, 09:02 PM Last Post: snippsat
	Need Help with Simple Text Reformatting Problem	MattTuck	5	3,766	Aug-14-2017, 10:07 PM Last Post: MattTuck
	read text file using python and display its output to html using django	amit	0	18,309	Jul-23-2017, 06:14 AM Last Post: amit

Problem formatting output text

User Panel Messages

Announcements