Python Forum
including the white space parts in str.split() - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: including the white space parts in str.split() (/thread-19222.html)



including the white space parts in str.split() - Skaperen - Jun-18-2019

i need to do a split of a string with white space as the delimiter but i need to keep string of what each white space delimiter actually is. what i need is a list such that ''.join(the_string.split()) would result in the same exact original string. is there a simple way to do this?


RE: including the white space parts in str.split() - Gribouillis - Jun-19-2019

re.split() does that
>>> import re
>>> re.split(r"(\s+)", "The quick brown fox  jumps over \tthe lazy dog")
['The', ' ', 'quick', ' ', 'brown', ' ', 'fox', '  ', 'jumps', ' ', 'over', ' \t', 'the', ' ', 'lazy', ' ', 'dog']



RE: including the white space parts in str.split() - Skaperen - Jun-19-2019

thank you! i still cannot comprehend how regular expressions work other than a couple basic things. but i assume you used a raw string so that \s would be passed to re.split() literally.


RE: including the white space parts in str.split() - Gribouillis - Jun-19-2019

Skaperen Wrote:i assume you used a raw string so that \s would be passed to re.split() literally.
Exactly, regular expressions need to be written in the regular expression language which contains special characters such as \ or (. In the case of \s, it doesn't make any difference because the python compiler doesn't interprete \s in literal strings. On the other hand it interpretes other escaped sequences such as \n in literal string. See for example
>>> print("\n")


>>> print(r"\n")
\n
>>> print("\s")
\s
>>> print(r"\s")
\s
My advice is to use raw strings by default when one writes regular expressions. One writes them for the re parser, not the python parser.


RE: including the white space parts in str.split() - Skaperen - Jun-19-2019

the logic this code needs is getting more and more complex to the point it is complicated. it is going to get a substring to look for in a larger string, starting at a given position and stopping at another given position. the substring to look for may have a special single character meant to match up with a run of one or more white-space characters in the string it is looking in. since it will be work with what follows, it will need to know where the match comes to an end, which will vary depending on runs of white-space that are matched. it needs to only match the full run of white-space when that is to happen, not just part of it. however, the starting and ending positions of the given larger string can cut off runs at each end.

as i work on this and try out all the jagged-corner cases, it seems i need to keep changing this all the time.


RE: including the white space parts in str.split() - Gribouillis - Jun-19-2019

Skaperen Wrote:the substring to look for may have a special single character meant to match up with a run of one or more white-space characters

This is what \s+ does in a regular expression. You could perhaps explain the real problem you're working on and the actual data that you need to match.


RE: including the white space parts in str.split() - Skaperen - Jun-20-2019

it is a sequence of characters given in a command argument meant to act similar to the cut command, that parse each line for output. right now a _ is meant to match a run of white space while \_ or in quotes just matches an underscore (in each line of input). i should probably make a version of this that somehow uses regular expressions, though i would have to ask others to test it.