splitting or parsing control characters

**nilamo** · May-26-2017, 03:16 PM

If we're using regexes, why not just use the whitespace escape character, \s? That seems much easier than matching only newlines and tabs or spaces.

>>> original = "foo\nbar\n\txyzzy"
>>> split = re.findall(r"(\S+)(\s*)", original)
>>> split
[('foo', '\n'), ('bar', '\n\t'), ('xyzzy', '')]
>>> flattened = [sub for pair in split for sub in pair if sub]
>>> flattened
['foo', '\n', 'bar', '\n\t', 'xyzzy']

***zivoni*** · (This post was last modified: May-26-2017, 04:01 PM by zivoni.)

Newspaces and tabs or backspaces were used only as examples.

I suggest to read first and third post where Skaperen explains what he does want - to split string on different control characters while keeping control characters, and even split control characters to groups of same ones.
[
Morever, when all you need is to split on some group of characters (without additional constraints), then re.split() is better than re.findall() - no need to flatten or to worry what will happen when string starts with separator.

Output:>>> original = "foo\nbar\n\txyzzy"
>>> re.split(r"(\s+)", original)
['foo', '\n', 'bar', '\n\t', 'xyzzy']

Beside using re.findall() with some "smart" pattern its possible to do straightly with some monstrous pattern

cs = ''.join([chr(c) for c in range(32)]
pat = r"[^{}]+|".format(cs) + "+|".join(cs) + "+"
re.findall(pat, text)

But it seems quite dirty - it has 33 different patterns, one main and one for each separator.

volcano63 · (This post was last modified: May-26-2017, 03:57 PM by volcano63.)

(May-26-2017, 03:07 PM)zivoni Wrote: is only slightly worse than
re.split('([{}]*)'.format(cs), 'Split   this string and   store    splits')
posted before that gives list instead of list of tuples.

Sorry, professor, but split does not give you "run" of the same separators - I believe OP asked to preserve each separator with its length

(May-26-2017, 03:07 PM)zivoni Wrote: Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing.

list(itertools.chain(*re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split   this   string and   store splits')))

Mission accomplished. Yalla, next question?

Oh, and space was used just as an example

***zivoni*** · (This post was last modified: May-26-2017, 05:00 PM by zivoni.)

(May-26-2017, 03:57 PM)volcano63 Wrote: Mission accomplished. Yalla, next question?

Yes, please. If I understood correctly Skaperen's examples in his posts #1 and #3, he wants for input such as

Output:
 >>> text = "one\x01\x01\x02two\x03three\x04\x05four"

output in the form

Output:>>> [[item, sep][item==''] for item, sep, _ in re.findall(pat, text)]
['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']

Somehow I cant see how it can be done with your solution. Indeed, I am not re expert and its possible that I am overlooking something. Or I could misinterpret Skaperen's intention. In either way I would like some explanation - either how your solution splits test text I have just posted or how I misintrepreted posts #1 and #3.

(May-26-2017, 03:57 PM)volcano63 Wrote: Sorry, professor, but split does not give you "run" of the same separators - I believe OP asked to preserve each separator with its length

Neither I see how your pattern can give it, actually I would like to see example where

re.findall(r'([^{0}]+)({0}*)'.format(cs), text)

performs "better" than

re.split (r'([{}]*)'.format(cs), text)

in this context (splitting while keeping separators). I believe that neither one of these two gives wanted result, but I might be wrong. Morever I think that your findall pattern does not behave too well with multiple different separators.

volcano63 · May-26-2017, 05:07 PM

(May-26-2017, 04:26 PM)zivoni Wrote: Somehow I cant see how it can be done with your solution. Indeed, I am not big re expert and its possible that I am overlooking something. Or I could misinterpret Skaperen's intention. In either way I would like some explanation - either how your solution splits test text I have just posted or how I misintrepreted posts #1 and #3.

Here is adjusted solution - intermittently looks for groups where characters fall without and within control characters range

text = "one\x01\x01\x02two\x03three\x04\x05four"
sep_range = '\x01-\x32'
re.findall(r'([^{0}]+)([{0}]*)'.format(sep_range), text)

range expression is better that

cs = "".join(chr(c) for c in range(32))

And format - well, I am too lazy to write the same string twice Snooty

.

The result is indeed list of tuples

 [('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', '')]

So itertools.chain flattens the result

[v for v in itertools.chain(('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', ''))if v]
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']

I don't count myself as re expert, but I know a trick or two...

***zivoni*** · (This post was last modified: May-26-2017, 05:30 PM by zivoni.)

(May-26-2017, 05:07 PM)volcano63 Wrote: The result is indeed list of tuples
 [('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', '')]
So itertools.chain flattens the result
[v for v in itertools.chain(('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', ''))if v]
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']
I don't count myself as re expert, but I know a trick or two...

With your modification it seems to give similar results as

 re.split ('([{}]*)'.format(cs), text)

while being much more complicated.

But note that

Output:
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']

is not

Output:
 ['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']

So according to #1 post I dont consider either of your suggestions or simple re.split as suitable solution.

volcano63 · May-26-2017, 06:11 PM

(May-26-2017, 05:26 PM)zivoni Wrote: ..........
is not
Output:
 ['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']
So according to #1 post I dont consider either of your suggestions or simple re.split as suitable solution.

Well, I was not aware of the split option with capturing group (and I missed that post, obviously Wall

), and I did not really read OP Snooty

, but in that case

splits = re.split(r'([\x01-\x20]+)'.format(sep_range), text)
splits
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']

So far, you are on the right track with split - just need a more convoluted little twist with my favorite itertools - not for the weak of heart Naughty

list(itertools.chain(*([s] if i % 2 == 0 else 
                      [''.join(g) for _, g in itertools.groupby(s)] 
                      for i, s in enumerate(splits))))

['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']

***zivoni*** · (This post was last modified: May-26-2017, 06:49 PM by zivoni.)

You should take care of eventual empty strings when processing text starting/ending with seperator(s), for example for string from post #1.

Beside that I think we can agree that both re.split() with simple pattern and additional processing or re.findall() with more complicated pattern and less additional processing (or even without any processing with "big" pattern) works.

Personally I consider code with re.findall() from post #4 slightly more readable than additional processing after split, but thats just subjective opinion ...

volcano63 · (This post was last modified: May-26-2017, 09:07 PM by volcano63.)

(May-26-2017, 06:46 PM)zivoni Wrote: You should take care of eventual empty strings when processing text starting/ending with seperator(s), for example for string from post #1.

Funny, I was thinking about it - just a slight adjustment

list(itertools.chain(*([s] if re.match('[^\x01-\x20]', s)
                     else [''.join(g) for _, g in itertools.groupby(s)] 
                     for s in splits)))

***zivoni*** · May-26-2017, 07:27 PM

(May-26-2017, 06:59 PM)volcano63 Wrote:

list(itertools.chain(*([s] if re.match('[\x01-\x20]', s)
                      [''.join(g) for _, g in itertools.groupby(s)] 
                      for i, s in enumerate(splits))))

Its probably little nitpicking, but I think that this adjustment can use some adjusting - like adding forgotten else, switching "ternary" arguments or negating condition, adding if s and perhaps removing now obsolete enumerate.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Remove escape characters / Unicode characters from string	DreamingInsanity	5	13,669	May-15-2020, 01:37 PM Last Post: snippsat
	splitting on 2 or more possible characters	Skaperen	6	3,544	Sep-03-2018, 04:10 AM Last Post: perfringo
	splitting a string by 2 characters	Skaperen	8	8,890	Dec-27-2016, 06:14 AM Last Post: wavic

splitting or parsing control characters

User Panel Messages

Announcements