splitting or parsing control characters

splitting or parsing control characters - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: splitting or parsing control characters (/thread-3457.html)

Pages: 1 2

splitting or parsing control characters - Skaperen - May-25-2017

i would like to do a split on a string (including bytes) such that it splits control characters as classes. for example if a string is 'foo\nbar' the split result would be ['foo',\n','bar']. note that i want to keep the splitting character. but 2 or more of the same splitting character should be left together so b'\x00\x01\x01\x02' would result in [b'\x00',b'\x01\x01',b'\x02']. any ideas?

RE: splitting or parsing control characters - snippsat - May-25-2017

Can use string method partition().

>>> s = 'foo\nbar'
>>> s.partition('\n')
('foo', '\n', 'bar')
>>> b = b'\x00\x01\x01\x02'
>>> b.partition(b'\x01\x01')
('\x00', '\x01\x01', '\x02')

RE: splitting or parsing control characters - Skaperen - May-25-2017

not happening as expected in some more involved cases:

Output:lt1/forums /home/forums 9> py3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo\nbar'.partition('\n')
('foo', '\n', 'bar')
>>> 'foo\nbar\nxyzzy'.partition('\n')
('foo', '\n', 'bar\nxyzzy')
>>> 'foo\nbar'.partition('\a\b\t\n\v\f\r')
('foo\nbar', '', '')
>>> 
lt1/forums /home/forums 10>

tuples are OK

in the 2nd case i am expecting ('foo', '\n', 'bar', '\n', 'xyzzy')

in the 3rd case i am expecting ('foo', '\n', 'bar')

basically i need something that works like .split() but can handle any of many characters where splitting happens. i want to be splitting on any character from '\x00' to '\x1f' inclusive.

cs = ''.join([chr(c) for c in range(32)])
bcs = bytes(range(32))

would define the set which will usually be the above, then splitting on that basis:

    b = 'one\ttwo\nthree\tfour\ffive'
    cntrlsplit(b,cs)

-> ['one','\t','two','\n','three','\t','four',\f','five']

does this make sense,yet?

RE: splitting or parsing control characters - zivoni - May-25-2017

You can do it with re. There is re.split that can be used with capture group, unfortunately I dont think that is possible to do more advanced matching,

Output:In [11]: import re
    ...: cs = "".join(chr(c) for c in range(32))
    ...: text = "foo\n\n\tbar\nboo\r\r\tfive"
    ...: re.split("([" + cs + "]+)", text)
    ...: 
Out[11]: ['foo', '\n\n\t', 'bar', '\n', 'boo', '\r\r\t', 'five']

it would be necessary to do additional processing to split seperators to groups (or you could abandon "+", split on ony control char and then join consecutive separators ...).

But you can do it directly with re.findall() with searching for either non separator chars or consecutive strings of same separator chars:

Output:In [15]: pat = r'([^{cs}]+)|(([{cs}])\3*)'.format(cs=cs)
In [16]: re.findall(pat, text)
Out[16]: 
[('foo', '', ''),
 ('', '\n\n', '\n'),
 ('', '\t', '\t'),
 ('bar', '', ''),
 ('', '\n', '\n'),
 ('boo', '', ''),
 ('', '\r\r', '\r'),
 ('', '\t', '\t'),
 ('five', '', '')]

Only thing that is needed is to take nonempty string from first two items from each tuple.

Output:In [20]: text
Out[20]: 'foo\n\n\tbar\nboo\r\r\tfive'

In [21]: [[item, sep][item==''] for item, sep, _ in re.findall(pat, text)]
Out[21]: ['foo', '\n\n', '\t', 'bar', '\n', 'boo', '\r\r', '\t', 'five']

RE: splitting or parsing control characters - wavic - May-25-2017

May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same

RE: splitting or parsing control characters - Skaperen - May-26-2017

(May-25-2017, 11:56 AM)wavic Wrote: May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same

right now the only way i can see to do this is making a function that has a loop to step through the string one character at a time, building the desired list, and returning the list when done. very non-pythonic and more like i would do in C. i will look into re.findall as that seems like it might do what i can use.

RE: splitting or parsing control characters - Skaperen - May-26-2017

i don't understand that pattern used to example re.findall()

RE: splitting or parsing control characters - zivoni - May-26-2017

For simpler notation lets consider just separators \n and \t. Then used pattern has form

 pat = r"([^\n\t]+)|(([\n\t])\3*)"

First part - pattern ([^\n\t]+) matches sequence of one or more non-separator characters.

Second part - pattern (([\n\t])\2*) is matching any sequence of consecutive \n or \t - group \2 in this pattern is either \n or \t, so entire pattern (group \1) matches either \n or \t followed by zero or more occurences of same character (separator).

When these two patterns are combined (while using \3 instead of \2, as there is capturing group from first pattern), it matches either consecutive non-separator characters or consecutive separators. As this pattern is in some sense "exhaustive" (any character in processed string would belong to some match), it works as needeed.

RE: splitting or parsing control characters - volcano63 - May-26-2017

cs = ' '
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split  this string  and   store   splits')

And the winner is

[('Split', '  '),
 ('this', ' '),
 ('string', '  '),
 ('and', '   '),
 ('store', '   '),
 ('splits', '')]

RE: splitting or parsing control characters - zivoni - May-26-2017

(May-26-2017, 02:00 PM)volcano63 Wrote:
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split  this string  and   store   splits')

is only slightly worse than

re.split('([{}]*)'.format(cs), 'Split   this string and   store    splits')

posted before that gives list instead of list of tuples.

Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing.