Posts: 4,654
Threads: 1,497
Joined: Sep 2016
i would like to do a split on a string (including bytes) such that it splits control characters as classes. for example if a string is 'foo\nbar' the split result would be ['foo',\n','bar'] . note that i want to keep the splitting character. but 2 or more of the same splitting character should be left together so b'\x00\x01\x01\x02' would result in [b'\x00',b'\x01\x01',b'\x02'] . any ideas?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Can use string method partition() .
>>> s = 'foo\nbar'
>>> s.partition('\n')
('foo', '\n', 'bar')
>>> b = b'\x00\x01\x01\x02'
>>> b.partition(b'\x01\x01')
('\x00', '\x01\x01', '\x02')
Posts: 4,654
Threads: 1,497
Joined: Sep 2016
May-25-2017, 07:11 AM
(This post was last modified: May-25-2017, 07:11 AM by Skaperen.)
not happening as expected in some more involved cases:
Output: lt1/forums /home/forums 9> py3
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo\nbar'.partition('\n')
('foo', '\n', 'bar')
>>> 'foo\nbar\nxyzzy'.partition('\n')
('foo', '\n', 'bar\nxyzzy')
>>> 'foo\nbar'.partition('\a\b\t\n\v\f\r')
('foo\nbar', '', '')
>>>
lt1/forums /home/forums 10>
tuples are OK
in the 2nd case i am expecting ('foo', '\n', 'bar', '\n', 'xyzzy')
in the 3rd case i am expecting ('foo', '\n', 'bar')
basically i need something that works like .split() but can handle any of many characters where splitting happens. i want to be splitting on any character from '\x00' to '\x1f' inclusive.
cs = ''.join([chr(c) for c in range(32)])
bcs = bytes(range(32)) would define the set which will usually be the above, then splitting on that basis:
b = 'one\ttwo\nthree\tfour\ffive'
cntrlsplit(b,cs) -> ['one','\t','two','\n','three','\t','four',\f','five']
does this make sense,yet?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 331
Threads: 2
Joined: Feb 2017
May-25-2017, 11:55 AM
(This post was last modified: May-25-2017, 11:55 AM by zivoni.)
You can do it with re. There is re.split that can be used with capture group, unfortunately I dont think that is possible to do more advanced matching,
Output: In [11]: import re
...: cs = "".join(chr(c) for c in range(32))
...: text = "foo\n\n\tbar\nboo\r\r\tfive"
...: re.split("([" + cs + "]+)", text)
...:
Out[11]: ['foo', '\n\n\t', 'bar', '\n', 'boo', '\r\r\t', 'five']
it would be necessary to do additional processing to split seperators to groups (or you could abandon "+", split on ony control char and then join consecutive separators ...).
But you can do it directly with re.findall() with searching for either non separator chars or consecutive strings of same separator chars:
Output: In [15]: pat = r'([^{cs}]+)|(([{cs}])\3*)'.format(cs=cs)
In [16]: re.findall(pat, text)
Out[16]:
[('foo', '', ''),
('', '\n\n', '\n'),
('', '\t', '\t'),
('bar', '', ''),
('', '\n', '\n'),
('boo', '', ''),
('', '\r\r', '\r'),
('', '\t', '\t'),
('five', '', '')]
Only thing that is needed is to take nonempty string from first two items from each tuple.
Output: In [20]: text
Out[20]: 'foo\n\n\tbar\nboo\r\r\tfive'
In [21]: [[item, sep][item==''] for item, sep, _ in re.findall(pat, text)]
Out[21]: ['foo', '\n\n', '\t', 'bar', '\n', 'boo', '\r\r', '\t', 'five']
Posts: 2,953
Threads: 48
Joined: Sep 2016
May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same
Posts: 4,654
Threads: 1,497
Joined: Sep 2016
(May-25-2017, 11:56 AM)wavic Wrote: May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same right now the only way i can see to do this is making a function that has a loop to step through the string one character at a time, building the desired list, and returning the list when done. very non-pythonic and more like i would do in C. i will look into re.findall as that seems like it might do what i can use.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,654
Threads: 1,497
Joined: Sep 2016
i don't understand that pattern used to example re.findall()
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 331
Threads: 2
Joined: Feb 2017
For simpler notation lets consider just separators \n and \t . Then used pattern has form
pat = r"([^\n\t]+)|(([\n\t])\3*)" First part - pattern ([^\n\t]+) matches sequence of one or more non-separator characters.
Second part - pattern (([\n\t])\2*) is matching any sequence of consecutive \n or \t - group \2 in this pattern is either \n or \t , so entire pattern (group \1) matches either \n or \t followed by zero or more occurences of same character (separator).
When these two patterns are combined (while using \3 instead of \2, as there is capturing group from first pattern), it matches either consecutive non-separator characters or consecutive separators. As this pattern is in some sense "exhaustive" (any character in processed string would belong to some match), it works as needeed.
Posts: 566
Threads: 10
Joined: Apr 2017
cs = ' '
re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split this string and store splits') And the winner is
[('Split', ' '),
('this', ' '),
('string', ' '),
('and', ' '),
('store', ' '),
('splits', '')]
Test everything in a Python shell (iPython, Azure Notebook, etc.) - Someone gave you an advice you liked? Test it - maybe the advice was actually bad.
- Someone gave you an advice you think is bad? Test it before arguing - maybe it was good.
- You posted a claim that something you did not test works? Be prepared to eat your hat.
Posts: 331
Threads: 2
Joined: Feb 2017
(May-26-2017, 02:00 PM)volcano63 Wrote: re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split this string and store splits') is only slightly worse than
re.split('([{}]*)'.format(cs), 'Split this string and store splits') posted before that gives list instead of list of tuples.
Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing.
|