splitting or parsing control characters

Skaperen · May-25-2017, 02:57 AM

i would like to do a split on a string (including bytes) such that it splits control characters as classes. for example if a string is 'foo\nbar' the split result would be ['foo',\n','bar']. note that i want to keep the splitting character. but 2 or more of the same splitting character should be left together so b'\x00\x01\x01\x02' would result in [b'\x00',b'\x01\x01',b'\x02']. any ideas?

***snippsat*** · May-25-2017, 03:08 AM

Can use string method partition().

>>> s = 'foo\nbar'
>>> s.partition('\n')
('foo', '\n', 'bar')
>>> b = b'\x00\x01\x01\x02'
>>> b.partition(b'\x01\x01')
('\x00', '\x01\x01', '\x02')

Skaperen · (This post was last modified: May-25-2017, 07:11 AM by Skaperen.)

not happening as expected in some more involved cases:

Output:lt1/forums /home/forums 9> py3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo\nbar'.partition('\n')
('foo', '\n', 'bar')
>>> 'foo\nbar\nxyzzy'.partition('\n')
('foo', '\n', 'bar\nxyzzy')
>>> 'foo\nbar'.partition('\a\b\t\n\v\f\r')
('foo\nbar', '', '')
>>> 
lt1/forums /home/forums 10>

tuples are OK

in the 2nd case i am expecting ('foo', '\n', 'bar', '\n', 'xyzzy')

in the 3rd case i am expecting ('foo', '\n', 'bar')

basically i need something that works like .split() but can handle any of many characters where splitting happens. i want to be splitting on any character from '\x00' to '\x1f' inclusive.

cs = ''.join([chr(c) for c in range(32)])
bcs = bytes(range(32))

would define the set which will usually be the above, then splitting on that basis:

    b = 'one\ttwo\nthree\tfour\ffive'
    cntrlsplit(b,cs)

-> ['one','\t','two','\n','three','\t','four',\f','five']

does this make sense,yet?

***zivoni*** · (This post was last modified: May-25-2017, 11:55 AM by zivoni.)

You can do it with re. There is re.split that can be used with capture group, unfortunately I dont think that is possible to do more advanced matching,

Output:In [11]: import re
    ...: cs = "".join(chr(c) for c in range(32))
    ...: text = "foo\n\n\tbar\nboo\r\r\tfive"
    ...: re.split("([" + cs + "]+)", text)
    ...: 
Out[11]: ['foo', '\n\n\t', 'bar', '\n', 'boo', '\r\r\t', 'five']

it would be necessary to do additional processing to split seperators to groups (or you could abandon "+", split on ony control char and then join consecutive separators ...).

But you can do it directly with re.findall() with searching for either non separator chars or consecutive strings of same separator chars:

Output:In [15]: pat = r'([^{cs}]+)|(([{cs}])\3*)'.format(cs=cs)
In [16]: re.findall(pat, text)
Out[16]: 
[('foo', '', ''),
 ('', '\n\n', '\n'),
 ('', '\t', '\t'),
 ('bar', '', ''),
 ('', '\n', '\n'),
 ('boo', '', ''),
 ('', '\r\r', '\r'),
 ('', '\t', '\t'),
 ('five', '', '')]

Only thing that is needed is to take nonempty string from first two items from each tuple.

Output:In [20]: text
Out[20]: 'foo\n\n\tbar\nboo\r\r\tfive'

In [21]: [[item, sep][item==''] for item, sep, _ in re.findall(pat, text)]
Out[21]: ['foo', '\n\n', '\t', 'bar', '\n', 'boo', '\r\r', '\t', 'five']

wavic · May-25-2017, 11:56 AM

May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same

Skaperen · May-26-2017, 03:06 AM

(May-25-2017, 11:56 AM)wavic Wrote: May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same

right now the only way i can see to do this is making a function that has a loop to step through the string one character at a time, building the desired list, and returning the list when done. very non-pythonic and more like i would do in C. i will look into re.findall as that seems like it might do what i can use.

Skaperen · May-26-2017, 04:41 AM

i don't understand that pattern used to example re.findall()

***zivoni*** · May-26-2017, 08:33 AM

For simpler notation lets consider just separators \n and \t. Then used pattern has form

 pat = r"([^\n\t]+)|(([\n\t])\3*)"

First part - pattern ([^\n\t]+) matches sequence of one or more non-separator characters.

Second part - pattern (([\n\t])\2*) is matching any sequence of consecutive \n or \t - group \2 in this pattern is either \n or \t, so entire pattern (group \1) matches either \n or \t followed by zero or more occurences of same character (separator).

When these two patterns are combined (while using \3 instead of \2, as there is capturing group from first pattern), it matches either consecutive non-separator characters or consecutive separators. As this pattern is in some sense "exhaustive" (any character in processed string would belong to some match), it works as needeed.

volcano63 · May-26-2017, 02:00 PM

cs = ' '
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split  this string  and   store   splits')

And the winner is

[('Split', '  '),
 ('this', ' '),
 ('string', '  '),
 ('and', '   '),
 ('store', '   '),
 ('splits', '')]

***zivoni*** · May-26-2017, 03:07 PM

(May-26-2017, 02:00 PM)volcano63 Wrote:
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split  this string  and   store   splits')

is only slightly worse than

re.split('([{}]*)'.format(cs), 'Split   this string and   store    splits')

posted before that gives list instead of list of tuples.

Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Remove escape characters / Unicode characters from string	DreamingInsanity	5	22,298	May-15-2020, 01:37 PM Last Post: snippsat
	splitting on 2 or more possible characters	Skaperen	6	4,921	Sep-03-2018, 04:10 AM Last Post: perfringo
	splitting a string by 2 characters	Skaperen	8	10,951	Dec-27-2016, 06:14 AM Last Post: wavic

splitting or parsing control characters

User Panel Messages

Announcements