Posts: 9
Threads: 2
Joined: Sep 2022
First: I didn't understand the moderator's message about 'labels' regarding my last post at all. I went to the link re: posting instructions but I'm afraid they are not perspicacious to the uninitiated.
My problem: I have a definition text data entry, e.g.,
"Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
I want to split the definitions into string elements of a list by using comma as delimiter to produce:
[ 'route', 'trend', 'way [route, direction]', 'course [direction]', 'direction [course, route]' ]
However, clearly I need the delimiter to ignore commas within square brackets; it has to be a context-sensitive delimiter. Is there a non-painful way to achieve this?
Thank you
Posts: 8
Threads: 3
Joined: Apr 2023
You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:
initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"
split_text = initial_text.split(",")
final_text = [s.replace(";", ",").strip() for s in split_text]
print(final_text) If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.
import re
text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
"direction [course, route]"
brackets = re.findall(r'\[.*?\]', text)
for sub in brackets:
text = text.replace(sub, "***") # "***" can be any sufficiently unique placeholder string that you are certain won't otherwise appear in your string
split_text = text.split(",")
final_text = []
iter_brackets = iter(brackets)
for s in split_text:
temp = s
if s.count("***") != 0:
temp = s.replace("***", next(iter_brackets)).strip()
final_text.append(temp)
print(final_text) But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.
Posts: 9
Threads: 2
Joined: Sep 2022
Thank you so much! That second solution is very nice.
(May-15-2023, 03:30 PM)idratherbecoding Wrote: You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:
initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"
split_text = initial_text.split(",")
final_text = [s.replace(";", ",").strip() for s in split_text]
print(final_text) If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.
import re
text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
"direction [course, route]"
brackets = re.findall(r'\[.*?\]', text)
for sub in brackets:
text = text.replace(sub, "***") # "***" can be any sufficiently unique placeholder string that you are certain won't otherwise appear in your string
split_text = text.split(",")
final_text = []
iter_brackets = iter(brackets)
for s in split_text:
temp = s
if s.count("***") != 0:
temp = s.replace("***", next(iter_brackets)).strip()
final_text.append(temp)
print(final_text) But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.
Posts: 4,780
Threads: 76
Joined: Jan 2018
Here is a way, using re.sub
from itertools import pairwise
import re
def our_split(data):
level = 0
position = [-1]
def _sub(match):
nonlocal level
c = match.group(0)
level += {'[': 1, ']': -1}.get(c, 0)
if c == ',' and not level:
position.append(match.start())
re.sub(r'[\[\],]', _sub, data)
position.append(len(data))
return [data[u+1:v] for u, v in pairwise(position)]
if __name__ == '__main__':
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print(result) Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 9
Threads: 2
Joined: Sep 2022
Thank you. This seems to rely upon keeping track of a nesting depth (with the 'level' variable), but at my level, I am struggling to make sense of the code.
(May-15-2023, 05:21 PM)Gribouillis Wrote: Here is a way, using re.sub
from itertools import pairwise
import re
def our_split(data):
level = 0
position = [-1]
def _sub(match):
nonlocal level
c = match.group(0)
level += {'[': 1, ']': -1}.get(c, 0)
if c == ',' and not level:
position.append(match.start())
re.sub(r'[\[\],]', _sub, data)
position.append(len(data))
return [data[u+1:v] for u, v in pairwise(position)]
if __name__ == '__main__':
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print(result) Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 4,780
Threads: 76
Joined: Jan 2018
May-15-2023, 07:51 PM
(This post was last modified: May-15-2023, 07:57 PM by Gribouillis.)
(May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code. I'll try to explain how it works. First I wrote a better version which may be more understandable
import re
def our_split(data):
depth = 0
pos = -1
substrings = []
def repl(match):
nonlocal depth, pos
match match.group(0):
case ',':
if depth == 0:
substrings.append(
data[pos + 1 : (pos := match.start())])
case '[':
depth += 1
case ']':
depth -= 1
re.sub(r'[\[\],]', repl, data)
substrings.append(data[pos + 1:])
return substrings
if __name__ == '__main__':
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print(result) - The main statement is line 20 with
re.sub(<regular expression>, <function>, <string>) . When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl() , passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
- For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
- At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
- In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 9
Threads: 2
Joined: Sep 2022
Thank you, yes, that version is easier to follow.
The main bits I don't follow are what you have inside the append() calls (though I guess I understand what they are doing), e.g, 'data[pos + 1 : (pos := match.start())]' and 'data[pos + 1:]'. Could you explain these?
Thank you
(May-15-2023, 07:51 PM)Gribouillis Wrote: (May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code. I'll try to explain how it works. First I wrote a better version which may be more understandable
import re
def our_split(data):
depth = 0
pos = -1
substrings = []
def repl(match):
nonlocal depth, pos
match match.group(0):
case ',':
if depth == 0:
substrings.append(
data[pos + 1 : (pos := match.start())])
case '[':
depth += 1
case ']':
depth -= 1
re.sub(r'[\[\],]', repl, data)
substrings.append(data[pos + 1:])
return substrings
if __name__ == '__main__':
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print(result) - The main statement is line 20 with
re.sub(<regular expression>, <function>, <string>) . When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl() , passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
- For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
- At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
- In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 4,780
Threads: 76
Joined: Jan 2018
May-15-2023, 09:50 PM
(This post was last modified: May-15-2023, 09:50 PM by Gribouillis.)
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())] pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1 . The position of the comma in the current match is match.start() . I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator := . So the new value of pos is the position of the comma in the current match.
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:] Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].
Posts: 9
Threads: 2
Joined: Sep 2022
I see, that's very clever, thank you. I didn't know about the walrus operator, looks like that's fairly new too.
(May-15-2023, 09:50 PM)Gribouillis Wrote: (May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())] pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1 . The position of the comma in the current match is match.start() . I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator := . So the new value of pos is the position of the comma in the current match.
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:] Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].
Posts: 4,780
Threads: 76
Joined: Jan 2018
An alternative to using re.sub() is re.finditer() . This introduces an explicit loop instead of a callback function but the advantage is that the our_split() function becomes a generator which is cleaner.
import re
def our_split(data):
depth = 0
pos = -1
for match in re.finditer(r'[\[\],]', data):
match match.group(0):
case ',':
if depth == 0:
yield data[pos + 1 : (pos := match.start())]
case '[':
depth += 1
case ']':
depth -= 1
yield data[pos + 1:]
if __name__ == '__main__':
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = list(our_split(data))
print(result)
|