Python Forum
splitting or parsing control characters
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
splitting or parsing control characters
#11
If we're using regexes, why not just use the whitespace escape character, \s?  That seems much easier than matching only newlines and tabs or spaces.

>>> original = "foo\nbar\n\txyzzy"
>>> split = re.findall(r"(\S+)(\s*)", original)
>>> split
[('foo', '\n'), ('bar', '\n\t'), ('xyzzy', '')]
>>> flattened = [sub for pair in split for sub in pair if sub]
>>> flattened
['foo', '\n', 'bar', '\n\t', 'xyzzy']
Reply
#12
Newspaces and tabs or backspaces were used only as examples.

I suggest to read first and third post where Skaperen explains what he does want - to split string on different control characters while keeping control characters, and even split control characters to groups of same ones.
[
Morever, when all you need is to split on some group of characters (without additional constraints), then re.split() is better than re.findall() - no need to flatten or to worry what will happen when string starts with separator.
Output:
>>> original = "foo\nbar\n\txyzzy" >>> re.split(r"(\s+)", original) ['foo', '\n', 'bar', '\n\t', 'xyzzy']
Beside using re.findall() with some "smart" pattern its possible to do straightly with some monstrous pattern
cs = ''.join([chr(c) for c in range(32)]
pat = r"[^{}]+|".format(cs) + "+|".join(cs) + "+"
re.findall(pat, text)
But it seems quite dirty - it has 33 different patterns, one main and one for each separator.
Reply
#13
(May-26-2017, 03:07 PM)zivoni Wrote: is only slightly worse than
re.split('([{}]*)'.format(cs), 'Split   this string and   store    splits')
posted before that gives list instead of list of tuples.

Sorry, professor, but split does not give you "run" of the same separators - I believe OP asked to preserve each separator with its length

(May-26-2017, 03:07 PM)zivoni Wrote: Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing.
list(itertools.chain(*re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split   this   string and   store splits')))
Mission accomplished. Yalla, next question?

Oh, and space was used just as an example
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#14
(May-26-2017, 03:57 PM)volcano63 Wrote: Mission accomplished. Yalla, next question?
Yes, please. If I understood correctly Skaperen's examples in his posts #1 and #3, he wants for input such as
Output:
>>> text = "one\x01\x01\x02two\x03three\x04\x05four"
output in the form
Output:
>>> [[item, sep][item==''] for item, sep, _ in re.findall(pat, text)] ['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']
Somehow I cant see how it can be done with your solution. Indeed, I am not re expert and its possible that I am overlooking something. Or I could misinterpret Skaperen's intention. In either way I would like some explanation - either how your solution splits test text I have just posted or how I misintrepreted posts #1 and #3.

(May-26-2017, 03:57 PM)volcano63 Wrote: Sorry, professor, but split does not give you "run" of the same separators - I believe OP asked to preserve each separator with its length
Neither I see how your pattern can give it, actually I would like to see example where
re.findall(r'([^{0}]+)({0}*)'.format(cs), text)
performs "better" than
re.split (r'([{}]*)'.format(cs), text)
in this context (splitting while keeping separators). I believe that neither one of these two gives wanted result, but I might be wrong. Morever I think that your findall pattern does not behave too well with multiple different separators.
Reply
#15
(May-26-2017, 04:26 PM)zivoni Wrote: Somehow I cant see how it can be done with your solution. Indeed, I am not big re expert and its possible that I am overlooking something. Or I could misinterpret Skaperen's intention. In either way I would like some explanation - either how your solution splits test text I have just posted or how I misintrepreted posts #1 and #3.
Here is adjusted solution - intermittently looks for groups where characters fall without and within control characters range
text = "one\x01\x01\x02two\x03three\x04\x05four"
sep_range = '\x01-\x32'
re.findall(r'([^{0}]+)([{0}]*)'.format(sep_range), text)
range expression is better that
cs = "".join(chr(c) for c in range(32))
And format - well, I am too lazy to write the same string twice  Snooty .

The result is indeed list of tuples
 [('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', '')]
So itertools.chain flattens the result
[v for v in itertools.chain(('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', ''))if v]
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']
I don't count myself as re expert, but I know a trick or two...
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#16
(May-26-2017, 05:07 PM)volcano63 Wrote: The result is indeed list of tuples
 [('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', '')]
So itertools.chain flattens the result
[v for v in itertools.chain(('one', '\x01\x01\x02'), ('two', '\x03'), ('three', '\x04\x05'), ('four', ''))if v]
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']
I don't count myself as re expert, but I know a trick or two...

With your modification it seems to give similar results as
 re.split ('([{}]*)'.format(cs), text)
while being much more complicated.

But note that
Output:
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']
is not
Output:
['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']
So according to #1 post I dont consider either of your suggestions or simple re.split as suitable solution.
Reply
#17
(May-26-2017, 05:26 PM)zivoni Wrote: ..........
is not
Output:
['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']
So according to #1 post I dont consider either of your suggestions or simple re.split as suitable solution.
Well, I was not aware of the split option with capturing group (and I missed that post, obviously Wall ), and I did not really read OP Snooty , but in that case

splits = re.split(r'([\x01-\x20]+)'.format(sep_range), text)
splits
['one', '\x01\x01\x02', 'two', '\x03', 'three', '\x04\x05', 'four']
So far, you are on the right track with split - just need a more convoluted little twist with my favorite itertools - not for the weak of heart  Naughty
list(itertools.chain(*([s] if i % 2 == 0 else 
                      [''.join(g) for _, g in itertools.groupby(s)] 
                      for i, s in enumerate(splits))))

['one', '\x01\x01', '\x02', 'two', '\x03', 'three', '\x04', '\x05', 'four']
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#18
You should take care of eventual empty strings when processing text starting/ending with seperator(s), for example for string from post #1.

Beside that I think we can agree that both re.split() with simple pattern and additional processing or re.findall() with more complicated pattern and less additional processing (or even without any processing with "big" pattern) works.

Personally I consider code with re.findall() from post #4 slightly more readable than additional processing after split, but thats just subjective opinion ...
Reply
#19
(May-26-2017, 06:46 PM)zivoni Wrote: You should take care of eventual empty strings when processing text starting/ending with seperator(s), for example for string from post #1.
Funny, I was thinking about it  - just a slight adjustment
list(itertools.chain(*([s] if re.match('[^\x01-\x20]', s)
                     else [''.join(g) for _, g in itertools.groupby(s)] 
                     for s in splits)))
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#20
(May-26-2017, 06:59 PM)volcano63 Wrote:
list(itertools.chain(*([s] if re.match('[\x01-\x20]', s)
                      [''.join(g) for _, g in itertools.groupby(s)] 
                      for i, s in enumerate(splits))))
Its probably little nitpicking, but I think that this adjustment can use some adjusting - like adding forgotten else, switching "ternary" arguments or negating condition, adding if s and perhaps removing now obsolete enumerate.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Remove escape characters / Unicode characters from string DreamingInsanity 5 13,669 May-15-2020, 01:37 PM
Last Post: snippsat
  splitting on 2 or more possible characters Skaperen 6 3,544 Sep-03-2018, 04:10 AM
Last Post: perfringo
  splitting a string by 2 characters Skaperen 8 8,890 Dec-27-2016, 06:14 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020