splitting or parsing control characters - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: splitting or parsing control characters (/thread-3457.html) Pages:
1
2
|
splitting or parsing control characters - Skaperen - May-25-2017 i would like to do a split on a string (including bytes) such that it splits control characters as classes. for example if a string is 'foo\nbar' the split result would be ['foo',\n','bar'] . note that i want to keep the splitting character. but 2 or more of the same splitting character should be left together so b'\x00\x01\x01\x02' would result in [b'\x00',b'\x01\x01',b'\x02'] . any ideas?
RE: splitting or parsing control characters - snippsat - May-25-2017 Can use string method partition() .>>> s = 'foo\nbar' >>> s.partition('\n') ('foo', '\n', 'bar') >>> b = b'\x00\x01\x01\x02' >>> b.partition(b'\x01\x01') ('\x00', '\x01\x01', '\x02') RE: splitting or parsing control characters - Skaperen - May-25-2017 not happening as expected in some more involved cases: tuples are OKin the 2nd case i am expecting ('foo', '\n', 'bar', '\n', 'xyzzy') in the 3rd case i am expecting ('foo', '\n', 'bar') basically i need something that works like .split() but can handle any of many characters where splitting happens. i want to be splitting on any character from '\x00' to '\x1f' inclusive. cs = ''.join([chr(c) for c in range(32)]) bcs = bytes(range(32))would define the set which will usually be the above, then splitting on that basis: b = 'one\ttwo\nthree\tfour\ffive' cntrlsplit(b,cs)-> ['one','\t','two','\n','three','\t','four',\f','five'] does this make sense,yet? RE: splitting or parsing control characters - zivoni - May-25-2017 You can do it with re. There is re.split that can be used with capture group, unfortunately I dont think that is possible to do more advanced matching, it would be necessary to do additional processing to split seperators to groups (or you could abandon "+", split on ony control char and then join consecutive separators ...).But you can do it directly with re.findall() with searching for either non separator chars or consecutive strings of same separator chars: Only thing that is needed is to take nonempty string from first two items from each tuple.
RE: splitting or parsing control characters - wavic - May-25-2017 May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the same RE: splitting or parsing control characters - Skaperen - May-26-2017 (May-25-2017, 11:56 AM)wavic Wrote: May be subclassing str and rewriting split() method. I don't see how can be done with less than that. Or a function which does the sameright now the only way i can see to do this is making a function that has a loop to step through the string one character at a time, building the desired list, and returning the list when done. very non-pythonic and more like i would do in C. i will look into re.findall as that seems like it might do what i can use. RE: splitting or parsing control characters - Skaperen - May-26-2017 i don't understand that pattern used to example re.findall() RE: splitting or parsing control characters - zivoni - May-26-2017 For simpler notation lets consider just separators \n and \t . Then used pattern has formpat = r"([^\n\t]+)|(([\n\t])\3*)"First part - pattern ([^\n\t]+) matches sequence of one or more non-separator characters. Second part - pattern (([\n\t])\2*) is matching any sequence of consecutive \n or \t - group \2 in this pattern is either \n or \t , so entire pattern (group \1) matches either \n or \t followed by zero or more occurences of same character (separator).When these two patterns are combined (while using \3 instead of \2, as there is capturing group from first pattern), it matches either consecutive non-separator characters or consecutive separators. As this pattern is in some sense "exhaustive" (any character in processed string would belong to some match), it works as needeed. RE: splitting or parsing control characters - volcano63 - May-26-2017 cs = ' ' re.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split this string and store splits')And the winner is [('Split', ' '), ('this', ' '), ('string', ' '), ('and', ' '), ('store', ' '), ('splits', '')] RE: splitting or parsing control characters - zivoni - May-26-2017 (May-26-2017, 02:00 PM)volcano63 Wrote:is only slightly worse thanre.findall(r'([^{0}]+)({0}*)'.format(cs), 'Split this string and store splits') re.split('([{}]*)'.format(cs), 'Split this string and store splits')posted before that gives list instead of list of tuples. Neither one of them does what is wanted (split different separators while keeping "runs" of same separators) without some additional processsing. |