Python Forum
splitting or parsing control characters
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
splitting or parsing control characters
#1
i would like to do a split on a string (including bytes) such that it splits control characters as classes.  for example if a string is 'foo\nbar' the split result would be ['foo',\n','bar'].  note that i want to keep the splitting character.  but 2 or more of the same splitting character should be left together so b'\x00\x01\x01\x02' would result in [b'\x00',b'\x01\x01',b'\x02'].  any ideas?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Can use string method partition().
>>> s = 'foo\nbar'
>>> s.partition('\n')
('foo', '\n', 'bar')
>>> b = b'\x00\x01\x01\x02'
>>> b.partition(b'\x01\x01')
('\x00', '\x01\x01', '\x02')
Reply
#3
not happening as expected in some more involved cases:

Output:
lt1/forums /home/forums 9> py3 Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> 'foo\nbar'.partition('\n') ('foo', '\n', 'bar') >>> 'foo\nbar\nxyzzy'.partition('\n') ('foo', '\n', 'bar\nxyzzy') >>> 'foo\nbar'.partition('\a\b\t\n\v\f\r') ('foo\nbar', '', '') >>> lt1/forums /home/forums 10>
tuples are OK

in the 2nd case i am expecting ('foo', '\n', 'bar', '\n', 'xyzzy')

in the 3rd case i am expecting ('foo', '\n', 'bar')

basically i need something that works like .split() but can handle any of many characters where splitting happens.  i want to be splitting on any character from '\x00' to '\x1f' inclusive.

cs = ''.join([chr(c) for c in range(32)])
bcs = bytes(range(32))
would define the set which will usually be the above, then splitting on that basis:

    b = 'one\ttwo\nthree\tfour\ffive'
    cntrlsplit(b,cs)
-> ['one','\t','two','\n','three','\t','four',\f','five']

does this make sense,yet?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
You can do it with re. There is re.split that can be used with capture group, unfortunately I dont think that is possible to do more advanced matching,
Output:
In [11]: import re     ...: cs = "".join(chr(c) for c in range(32))     ...: text = "foo\n\n\tbar\nboo\r\r\tfive"     ...: re.split("([" + cs + "]+)", text)     ...: Out[11]: ['foo', '\n\n\t', 'bar', '\n', 'boo', '\r\r\t', 'five']
it would be necessary to do additional processing to split seperators to groups (or you could abandon "+", split on ony control char and then join consecutive separators ...).

But you can do it directly with re.findall() with searching for either non separator chars or consecutive strings of same separator chars:
Output:
In [15]: pat = r'([^{cs}]+)|(([{cs}])\3*)'.format(cs=cs) In [16]: re.findall(pat, text) Out[16]: [('foo', '', ''), ('', '\n\n', '\n'), ('', '\t', '\t'), ('bar', '', ''), ('', '\n', '\n'), ('boo', '', ''), ('', '\r\r', '\r'), ('', '\t', '\t'), ('five', '', '')]
Only thing that is needed is to take nonempty string from first two items from each tuple.
Output:
In [20]: text Out[20]: 'foo\n\n\tbar\nboo\r\r\tfive' In [21]: [[item, sep][item==''] for item, sep, _ in re.findall(pat, text)] Out[21]: ['foo', '\n\n', '\t', 'bar', '\n', 'boo', '\r\r', '\t', 'five']
Reply
#5
May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#6
(May-25-2017, 11:56 AM)wavic Wrote: May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same
right now the only way i can see to do this is making a function that has a loop to step through the string one character at a time, building the desired list, and returning the list when done.  very non-pythonic and more like i would do in C.  i will look into re.findall as that seems like it might do what i can use.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
i don't understand that pattern used to example re.findall()
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#8
For simpler notation lets consider just separators \n and \t. Then used pattern has form
 pat = r"([^\n\t]+)|(([\n\t])\3*)"
First part - pattern ([^\n\t]+) matches sequence of one or more non-separator characters.

Second part - pattern (([\n\t])\2*) is matching any sequence of consecutive \n or \t - group \2 in this pattern is either \n or \t, so entire pattern (group \1) matches either \n or \t followed by zero or more occurences of same character (separator).

When these two patterns are combined (while using \3 instead of \2, as there is capturing group from first pattern), it matches either consecutive non-separator characters or consecutive separators. As this pattern is in some sense "exhaustive" (any character in processed string would belong to some match), it works as needeed.
Reply
#9
cs = ' '
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split  this string  and   store   splits')
And the winner is
[('Split', '  '),
 ('this', ' '),
 ('string', '  '),
 ('and', '   '),
 ('store', '   '),
 ('splits', '')]
Test everything in a Python shell (iPython, Azure Notebook, etc.)
  • Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
  • Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
  • You posted a claim that something you did not test works? Be prepared to eat your hat.
Reply
#10
(May-26-2017, 02:00 PM)volcano63 Wrote:
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split  this string  and   store   splits')
is only slightly worse than
re.split('([{}]*)'.format(cs), 'Split   this string and   store    splits')
posted before that gives list instead of list of tuples.

Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Remove escape characters / Unicode characters from string DreamingInsanity 5 13,829 May-15-2020, 01:37 PM
Last Post: snippsat
  splitting on 2 or more possible characters Skaperen 6 3,569 Sep-03-2018, 04:10 AM
Last Post: perfringo
  splitting a string by 2 characters Skaperen 8 8,946 Dec-27-2016, 06:14 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020