Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Problem formatting output text
#1
The following text is a small part of a large (4MB) file that I am trying pull text
from and save it to a csv file to open in Excel.

<Tag Name="C0000" TagType="Base" DataType="DINT" Radix="Decimal" Constant="false" ExternalAccess="Read/Write">
<Description>
<![CDATA[System
Signals
]]>
</Description>
<Comments>
<Comment Operand=".0">
<![CDATA[System
Always OFF
Bit
]]>
</Comment>
<Comment Operand=".1">
<![CDATA[System
Always ON
Bit
]]>
</Comment>
<Comment Operand=".2">
<![CDATA[System
Simulation State
]]>
</Comment>
...etc.


The bold face chars are what I want to put together into another file.
The .0, .1, .2 etc goes on to a max of .31 but can end before that.
The Tag Name "C0000" is just text and usually a descriptive name.
After .31 or what ever another Tag Name (ie. "C0001", "Timer", etc) and then we do it again 0 to 31.

So far I've got a beginning but I am now struggling. Huh

#open file to read
with open('E:\\ab-test.txt') as fo:
    #read a line
    for x in fo.read().split("\n"):
        #if we find 'Tag Name' get what follows
        if (re.findall('Tag Name',x)):
            print x[11:16],
        #now look for CDATA and get text up to ]
        if (re.findall('CDATA',x)):
            print x[9:]
        
    fo.close()
Output:
C0000 System System . . System
What I want is to have the output look like the following:
Output:
C0000 System Signals C0000.0 , System Always OFF Bit C0000.1 , System Always ON Bit C0000.2 , System Simulation State
Once I have the output done it will be saved to newfile.csv
The forum states to give as much info as possible I hope I have done this.
I'm using python2.7 on a pc.

Thanks for reading to the end.
And thanks for any help.
Reply
#2
Regular expression are not a good way to search XML. Use a dedicated XML parser like lxml.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
Use a parser a mention,for CDATA is in fact html.parser fine to to use in BS.
Can plug in parsers into BS,usually i use lxml as parser.
Example:
from bs4 import BeautifulSoup

cdata = '''\
<Tag Name="C0000" TagType="Base" DataType="DINT" Radix="Decimal" Constant="false" ExternalAccess="Read/Write">
    <Description>
        <![CDATA[System Signals]]>
    </Description>
    <Comments>
        <Comment Operand=".0">
            <![CDATA[System Always OFFBit]]>
        </Comment>
        <Comment Operand=".1">
            <![CDATA[System Always ONBit]]>
        </Comment>
        <Comment Operand=".2">
            <![CDATA[System Simulation State]]>
        </Comment>
     </Comments>'''

soup = BeautifulSoup(cdata, 'html.parser')
Test:
>>> soup.find_all('comment')
[<comment operand=".0">
<![CDATA[System Always OFFBit]]>
</comment>,
 <comment operand=".1">
<![CDATA[System Always ONBit]]>
</comment>,
 <comment operand=".2">
<![CDATA[System Simulation State]]>
</comment>]

>>> soup.find_all('comment')[1].text.strip()
'System Always ONBit'
>>> soup.find_all('comment')[2].text.strip()
'System Simulation State'
>>> 
Wanted output:
>>> all_comment = soup.find_all('comment')
>>> for index,item in enumerate(all_comment):
...     print('C0000.{}'.format(index), item.text.strip())
... 
C0000.0 System Always OFFBit
C0000.1 System Always ONBit
C0000.2 System Simulation State
Reply
#4
Thanks to both for your replies.
I have come across BeautifulSoup but never tried it and didn't know about lxml, that I'm looking into now.

I will try to reply with a bit more detailed info of my problem, hopefully that will clear up what I am trying to do.
I'm not a programmer that's #1 issue and started using Python when I got my RaspberryPi.

Question, XML file is still basically a text file, isn't it?

I was hoping for a way that doesn't add more challenges (as in add-on's).

However, I do well with YouTube and hitting pause at each step to allow it to sink in.

Cheers
Reply
#5
(Sep-10-2017, 03:15 PM)aj347 Wrote: Question, XML file is still basically a text file, isn't it?
 was hoping for a way that doesn't add more challenges (as in add-on's).
It's a text file with markup Language,which XML is.
Using "add.on's" 3-party libraries is a important  part of using and learning Python.
Using regex on XML/HTML is wrong and difficult.
Now is CDATA a special case and regex can work better for that.
Quote:CDATA stands for Character Data and it means that the data in between these strings includes data,
that could be interpreted as XML markup, but should not be.

Python has pip build in on all newer Python versions,
this make it simple to install 3-party libraries.
pip install beautifulsoup4 lxml
That's all then you have BS and lxml.

I have two part tutorial where i use BS and lxml for parsing.
part1, part2.
Reply
#6
(Sep-10-2017, 03:15 PM)aj347 Wrote: Question, XML file is still basically a text file, isn't it?

I was hoping for a way that doesn't add more challenges (as in add-on's).

xml is just a plain text file. However, that also means there's no requirements about where whitespace is. If you try doing it yourself, using the method you've been using, then you'll never get data if the source changes to look like this:
<Tag
  Name="C0000"
  TagType="Base"
  DataType="DINT"
  Radix="Decimal"
  Constant="false"
  ExternalAccess="Read/Write">
<Description>
<![CDATA[
System Signals
]]>
</Description>
</Tag>
An xml parser would be able to handle that just fine, though. If you just want to do something fast, then using a module to do most of the work for you is almost always going to be the better option.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Formatting Output After Web Scraping yoitspython 3 2,888 Aug-01-2019, 01:22 PM
Last Post: snippsat
  Formatting Output after Web Scrape yoitspython 2 2,461 Jul-30-2019, 08:39 PM
Last Post: yoitspython
  html to text problem Kyle 4 5,591 Apr-27-2018, 09:02 PM
Last Post: snippsat
  Need Help with Simple Text Reformatting Problem MattTuck 5 3,766 Aug-14-2017, 10:07 PM
Last Post: MattTuck
  read text file using python and display its output to html using django amit 0 18,309 Jul-23-2017, 06:14 AM
Last Post: amit

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020