Context-sensitive delimiter

ZZTurn · May-15-2023, 08:19 AM

First: I didn't understand the moderator's message about 'labels' regarding my last post at all. I went to the link re: posting instructions but I'm afraid they are not perspicacious to the uninitiated.

My problem: I have a definition text data entry, e.g.,

"Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

I want to split the definitions into string elements of a list by using comma as delimiter to produce:

[ 'route', 'trend', 'way [route, direction]', 'course [direction]', 'direction [course, route]' ]

However, clearly I need the delimiter to ignore commas within square brackets; it has to be a context-sensitive delimiter. Is there a non-painful way to achieve this?

Thank you

idratherbecoding · May-15-2023, 03:30 PM

You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:

initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"

split_text = initial_text.split(",")

final_text = [s.replace(";", ",").strip() for s in split_text]

print(final_text)

If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.

import re

text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
           "direction [course, route]"

brackets = re.findall(r'\[.*?\]', text)

for sub in brackets:
    text = text.replace(sub, "***") # "***" can be any sufficiently unique placeholder string that you are certain won't otherwise appear in your string

split_text = text.split(",")

final_text = []

iter_brackets = iter(brackets)
    
for s in split_text:
    temp = s
    if s.count("***") != 0:
        temp = s.replace("***", next(iter_brackets)).strip()
    final_text.append(temp)

print(final_text)

But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.

ZZTurn · May-15-2023, 05:08 PM

Thank you so much! That second solution is very nice.

(May-15-2023, 03:30 PM)idratherbecoding Wrote: You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:
initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"

split_text = initial_text.split(",")

final_text = [s.replace(";", ",").strip() for s in split_text]

print(final_text)
If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.
import re

text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
           "direction [course, route]"

brackets = re.findall(r'\[.*?\]', text)

for sub in brackets:
    text = text.replace(sub, "***") # "***" can be any sufficiently unique placeholder string that you are certain won't otherwise appear in your string

split_text = text.split(",")

final_text = []

iter_brackets = iter(brackets)
    
for s in split_text:
    temp = s
    if s.count("***") != 0:
        temp = s.replace("***", next(iter_brackets)).strip()
    final_text.append(temp)

print(final_text)
But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.

**Gribouillis** · May-15-2023, 05:21 PM

Here is a way, using re.sub

from itertools import pairwise
import re

def our_split(data):
    level = 0
    position = [-1]

    def _sub(match):
        nonlocal level
        c = match.group(0)
        level += {'[': 1, ']': -1}.get(c, 0)
        if c == ',' and not level:
            position.append(match.start())

    re.sub(r'[\[\],]', _sub, data)
    position.append(len(data))
    return [data[u+1:v] for u, v in pairwise(position)]

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)

Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']

ZZTurn · May-15-2023, 06:58 PM

Thank you. This seems to rely upon keeping track of a nesting depth (with the 'level' variable), but at my level, I am struggling to make sense of the code.

(May-15-2023, 05:21 PM)Gribouillis Wrote: Here is a way, using re.sub

from itertools import pairwise
import re

def our_split(data):
    level = 0
    position = [-1]

    def _sub(match):
        nonlocal level
        c = match.group(0)
        level += {'[': 1, ']': -1}.get(c, 0)
        if c == ',' and not level:
            position.append(match.start())

    re.sub(r'[\[\],]', _sub, data)
    position.append(len(data))
    return [data[u+1:v] for u, v in pairwise(position)]

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)

Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']

**Gribouillis** · (This post was last modified: May-15-2023, 07:57 PM by Gribouillis.)

(May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code.

I'll try to explain how it works. First I wrote a better version which may be more understandable

import re

def our_split(data):
    depth = 0
    pos = -1
    substrings = []

    def repl(match):
        nonlocal depth, pos
        match match.group(0):
            case ',':
                if depth == 0:
                    substrings.append(
                        data[pos + 1 : (pos := match.start())])
            case '[':
                depth += 1
            case ']':
                depth -= 1

    re.sub(r'[\[\],]', repl, data)
    substrings.append(data[pos + 1:])
    return substrings

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)

The main statement is line 20 with re.sub(<regular expression>, <function>, <string>). When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl(), passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).

Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']

ZZTurn · May-15-2023, 09:38 PM

Thank you, yes, that version is easier to follow.

The main bits I don't follow are what you have inside the append() calls (though I guess I understand what they are doing), e.g, 'data[pos + 1 : (pos := match.start())]' and 'data[pos + 1:]'. Could you explain these?

Thank you

(May-15-2023, 07:51 PM)Gribouillis Wrote:
(May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code.
I'll try to explain how it works. First I wrote a better version which may be more understandable
import re

def our_split(data):
    depth = 0
    pos = -1
    substrings = []

    def repl(match):
        nonlocal depth, pos
        match match.group(0):
            case ',':
                if depth == 0:
                    substrings.append(
                        data[pos + 1 : (pos := match.start())])
            case '[':
                depth += 1
            case ']':
                depth -= 1

    re.sub(r'[\[\],]', repl, data)
    substrings.append(data[pos + 1:])
    return substrings

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)
The main statement is line 20 with re.sub(<regular expression>, <function>, <string>). When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl(), passing it a «match» object as argument, which contains essentially the character in question and its position in the string.

For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.

At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.

In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']

**Gribouillis** · (This post was last modified: May-15-2023, 09:50 PM by Gribouillis.)

(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())]

pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1. The position of the comma in the current match is match.start(). I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator :=. So the new value of pos is the position of the comma in the current match.

(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:]

Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].

ZZTurn · May-15-2023, 10:10 PM

I see, that's very clever, thank you. I didn't know about the walrus operator, looks like that's fairly new too.

(May-15-2023, 09:50 PM)Gribouillis Wrote:
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())]
pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1. The position of the comma in the current match is match.start(). I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator :=. So the new value of pos is the position of the comma in the current match.

(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:]
Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].

**Gribouillis** · May-16-2023, 07:31 AM

An alternative to using re.sub() is re.finditer(). This introduces an explicit loop instead of a callback function but the advantage is that the our_split() function becomes a generator which is cleaner.

import re

def our_split(data):
    depth = 0
    pos = -1

    for match in re.finditer(r'[\[\],]', data):
        match match.group(0):
            case ',':
                if depth == 0:
                    yield data[pos + 1 : (pos := match.start())]
            case '[':
                depth += 1
            case ']':
                depth -= 1

    yield data[pos + 1:]

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = list(our_split(data))
    print(result)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	python Read each xlsx file and write it into csv with pipe delimiter	mg24	4	4,009	Nov-09-2023, 10:56 AM Last Post: mg24
	Read csv file with inconsistent delimiter	gracenz	2	2,628	Mar-27-2023, 08:59 PM Last Post: deanhystad
	How does open context manager work?	deanhystad	7	2,914	Nov-08-2022, 02:45 PM Last Post: deanhystad
	Delimiter issue with a CSV file	jehoshua	1	2,489	Apr-19-2022, 01:28 AM Last Post: jehoshua
	Decimal context	stevendaprano	1	1,612	Apr-11-2022, 09:44 PM Last Post: deanhystad
	How to create new line '/n' at each delimiter in a string?	MikeAW2010	3	6,973	Dec-15-2020, 05:21 PM Last Post: snippsat
	Case sensitive checks	kam_uk	2	2,757	Nov-30-2020, 01:25 AM Last Post: bowlofred
	copy content of text file with three delimiter into excel sheet	vinaykumar	0	2,952	Jul-12-2020, 01:27 PM Last Post: vinaykumar
	How to print string multiple times separated by delimiter	Mekala	1	2,585	May-23-2020, 09:21 AM Last Post: Yoriz
	TextIOWrapper.tell() with Python 3.6.9 in context of 0D/0A	fschaef	0	2,787	Mar-29-2020, 09:18 AM Last Post: fschaef

Context-sensitive delimiter

User Panel Messages

Announcements