Posts: 9
Threads: 2
Joined: Sep 2022
First: I didn't understand the moderator's message about 'labels' regarding my last post at all. I went to the link re: posting instructions but I'm afraid they are not perspicacious to the uninitiated.
My problem: I have a definition text data entry, e.g.,
"Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
I want to split the definitions into string elements of a list by using comma as delimiter to produce:
[ 'route', 'trend', 'way [route, direction]', 'course [direction]', 'direction [course, route]' ]
However, clearly I need the delimiter to ignore commas within square brackets; it has to be a context-sensitive delimiter. Is there a non-painful way to achieve this?
Thank you
Posts: 8
Threads: 3
Joined: Apr 2023
You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:
1 2 3 4 5 6 7 |
initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"
split_text = initial_text.split( "," )
final_text = [s.replace( ";" , "," ).strip() for s in split_text]
print (final_text)
|
If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import re
text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
"direction [course, route]"
brackets = re.findall(r '\[.*?\]' , text)
for sub in brackets:
text = text.replace(sub, "***" )
split_text = text.split( "," )
final_text = []
iter_brackets = iter (brackets)
for s in split_text:
temp = s
if s.count( "***" ) ! = 0 :
temp = s.replace( "***" , next (iter_brackets)).strip()
final_text.append(temp)
print (final_text)
|
But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.
Posts: 9
Threads: 2
Joined: Sep 2022
Thank you so much! That second solution is very nice.
(May-15-2023, 03:30 PM)idratherbecoding Wrote: You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:
1 2 3 4 5 6 7 |
initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"
split_text = initial_text.split( "," )
final_text = [s.replace( ";" , "," ).strip() for s in split_text]
print (final_text)
|
If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import re
text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
"direction [course, route]"
brackets = re.findall(r '\[.*?\]' , text)
for sub in brackets:
text = text.replace(sub, "***" )
split_text = text.split( "," )
final_text = []
iter_brackets = iter (brackets)
for s in split_text:
temp = s
if s.count( "***" ) ! = 0 :
temp = s.replace( "***" , next (iter_brackets)).strip()
final_text.append(temp)
print (final_text)
|
But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.
Posts: 4,783
Threads: 76
Joined: Jan 2018
Here is a way, using re.sub
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from itertools import pairwise
import re
def our_split(data):
level = 0
position = [ - 1 ]
def _sub(match):
nonlocal level
c = match.group( 0 )
level + = { '[' : 1 , ']' : - 1 }.get(c, 0 )
if c = = ',' and not level:
position.append(match.start())
re.sub(r '[\[\],]' , _sub, data)
position.append( len (data))
return [data[u + 1 :v] for u, v in pairwise(position)]
if __name__ = = '__main__' :
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print (result)
|
Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 9
Threads: 2
Joined: Sep 2022
Thank you. This seems to rely upon keeping track of a nesting depth (with the 'level' variable), but at my level, I am struggling to make sense of the code.
(May-15-2023, 05:21 PM)Gribouillis Wrote: Here is a way, using re.sub
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from itertools import pairwise
import re
def our_split(data):
level = 0
position = [ - 1 ]
def _sub(match):
nonlocal level
c = match.group( 0 )
level + = { '[' : 1 , ']' : - 1 }.get(c, 0 )
if c = = ',' and not level:
position.append(match.start())
re.sub(r '[\[\],]' , _sub, data)
position.append( len (data))
return [data[u + 1 :v] for u, v in pairwise(position)]
if __name__ = = '__main__' :
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print (result)
|
Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 4,783
Threads: 76
Joined: Jan 2018
May-15-2023, 07:51 PM
(This post was last modified: May-15-2023, 07:57 PM by Gribouillis.)
(May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code. I'll try to explain how it works. First I wrote a better version which may be more understandable
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
import re
def our_split(data):
depth = 0
pos = - 1
substrings = []
def repl(match):
nonlocal depth, pos
match match.group( 0 ):
case ',' :
if depth = = 0 :
substrings.append(
data[pos + 1 : (pos : = match.start())])
case '[' :
depth + = 1
case ']' :
depth - = 1
re.sub(r '[\[\],]' , repl, data)
substrings.append(data[pos + 1 :])
return substrings
if __name__ = = '__main__' :
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print (result)
|
- The main statement is line 20 with
re.sub(<regular expression>, <function>, <string>) . When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl() , passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
- For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
- At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
- In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 9
Threads: 2
Joined: Sep 2022
Thank you, yes, that version is easier to follow.
The main bits I don't follow are what you have inside the append() calls (though I guess I understand what they are doing), e.g, 'data[pos + 1 : (pos := match.start())]' and 'data[pos + 1:]'. Could you explain these?
Thank you
(May-15-2023, 07:51 PM)Gribouillis Wrote: (May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code. I'll try to explain how it works. First I wrote a better version which may be more understandable
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
import re
def our_split(data):
depth = 0
pos = - 1
substrings = []
def repl(match):
nonlocal depth, pos
match match.group( 0 ):
case ',' :
if depth = = 0 :
substrings.append(
data[pos + 1 : (pos : = match.start())])
case '[' :
depth + = 1
case ']' :
depth - = 1
re.sub(r '[\[\],]' , repl, data)
substrings.append(data[pos + 1 :])
return substrings
if __name__ = = '__main__' :
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = our_split(data)
print (result)
|
- The main statement is line 20 with
re.sub(<regular expression>, <function>, <string>) . When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl() , passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
- For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
- At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
- In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output: ['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Posts: 4,783
Threads: 76
Joined: Jan 2018
May-15-2023, 09:50 PM
(This post was last modified: May-15-2023, 09:50 PM by Gribouillis.)
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())] pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1 . The position of the comma in the current match is match.start() . I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator := . So the new value of pos is the position of the comma in the current match.
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:] Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].
Posts: 9
Threads: 2
Joined: Sep 2022
I see, that's very clever, thank you. I didn't know about the walrus operator, looks like that's fairly new too.
(May-15-2023, 09:50 PM)Gribouillis Wrote: (May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())] pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1 . The position of the comma in the current match is match.start() . I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator := . So the new value of pos is the position of the comma in the current match.
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:] Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].
Posts: 4,783
Threads: 76
Joined: Jan 2018
An alternative to using re.sub() is re.finditer() . This introduces an explicit loop instead of a callback function but the advantage is that the our_split() function becomes a generator which is cleaner.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import re
def our_split(data):
depth = 0
pos = - 1
for match in re.finditer(r '[\[\],]' , data):
match match.group( 0 ):
case ',' :
if depth = = 0 :
yield data[pos + 1 : (pos : = match.start())]
case '[' :
depth + = 1
case ']' :
depth - = 1
yield data[pos + 1 :]
if __name__ = = '__main__' :
data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"
result = list (our_split(data))
print (result)
|
|