Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Context-sensitive delimiter
#1
First: I didn't understand the moderator's message about 'labels' regarding my last post at all. I went to the link re: posting instructions but I'm afraid they are not perspicacious to the uninitiated.

My problem: I have a definition text data entry, e.g.,

"Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

I want to split the definitions into string elements of a list by using comma as delimiter to produce:

[ 'route', 'trend', 'way [route, direction]', 'course [direction]', 'direction [course, route]' ]

However, clearly I need the delimiter to ignore commas within square brackets; it has to be a context-sensitive delimiter. Is there a non-painful way to achieve this?

Thank you
Reply
#2
You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:

initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"

split_text = initial_text.split(",")

final_text = [s.replace(";", ",").strip() for s in split_text]

print(final_text)
If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.

import re

text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
           "direction [course, route]"

brackets = re.findall(r'\[.*?\]', text)

for sub in brackets:
    text = text.replace(sub, "***") # "***" can be any sufficiently unique placeholder string that you are certain won't otherwise appear in your string

split_text = text.split(",")

final_text = []

iter_brackets = iter(brackets)
    
for s in split_text:
    temp = s
    if s.count("***") != 0:
        temp = s.replace("***", next(iter_brackets)).strip()
    final_text.append(temp)

print(final_text)
But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.
Reply
#3
Thank you so much! That second solution is very nice.



(May-15-2023, 03:30 PM)idratherbecoding Wrote: You could use a different delimiter, say a semicolon inside the square brackets. After you have your list, you can replace the semicolon (or whichever delimiter you chose) with a comma. Something like this is what I'm thinking:

initial_text = "Richtung route, trend, way [route; direction], tendency [political etc.], course [direction], direction [course; route]"

split_text = initial_text.split(",")

final_text = [s.replace(";", ",").strip() for s in split_text]

print(final_text)
If you really need to keep commas inside the square brackets, you can use regular expressions to extract the text in brackets, replace the bracketed text with a placeholder string, split the string with a comma delimiter, and finally, replace the placeholder string with the original bracketed text.

import re

text = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], " \
           "direction [course, route]"

brackets = re.findall(r'\[.*?\]', text)

for sub in brackets:
    text = text.replace(sub, "***") # "***" can be any sufficiently unique placeholder string that you are certain won't otherwise appear in your string

split_text = text.split(",")

final_text = []

iter_brackets = iter(brackets)
    
for s in split_text:
    temp = s
    if s.count("***") != 0:
        temp = s.replace("***", next(iter_brackets)).strip()
    final_text.append(temp)

print(final_text)
But if the commas inside the square brackets are not required, the first solution seems more straightforward to me. Hope this helps.
Reply
#4
Here is a way, using re.sub
from itertools import pairwise
import re

def our_split(data):
    level = 0
    position = [-1]

    def _sub(match):
        nonlocal level
        c = match.group(0)
        level += {'[': 1, ']': -1}.get(c, 0)
        if c == ',' and not level:
            position.append(match.start())

    re.sub(r'[\[\],]', _sub, data)
    position.append(len(data))
    return [data[u+1:v] for u, v in pairwise(position)]

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)
Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Reply
#5
Thank you. This seems to rely upon keeping track of a nesting depth (with the 'level' variable), but at my level, I am struggling to make sense of the code.

(May-15-2023, 05:21 PM)Gribouillis Wrote: Here is a way, using re.sub
from itertools import pairwise
import re

def our_split(data):
    level = 0
    position = [-1]

    def _sub(match):
        nonlocal level
        c = match.group(0)
        level += {'[': 1, ']': -1}.get(c, 0)
        if c == ',' and not level:
            position.append(match.start())

    re.sub(r'[\[\],]', _sub, data)
    position.append(len(data))
    return [data[u+1:v] for u, v in pairwise(position)]

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)
Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Reply
#6
(May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code.
I'll try to explain how it works. First I wrote a better version which may be more understandable
import re

def our_split(data):
    depth = 0
    pos = -1
    substrings = []

    def repl(match):
        nonlocal depth, pos
        match match.group(0):
            case ',':
                if depth == 0:
                    substrings.append(
                        data[pos + 1 : (pos := match.start())])
            case '[':
                depth += 1
            case ']':
                depth -= 1

    re.sub(r'[\[\],]', repl, data)
    substrings.append(data[pos + 1:])
    return substrings

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)
  • The main statement is line 20 with re.sub(<regular expression>, <function>, <string>). When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl(), passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
  • For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
  • At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
  • In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Reply
#7
Thank you, yes, that version is easier to follow.

The main bits I don't follow are what you have inside the append() calls (though I guess I understand what they are doing), e.g, 'data[pos + 1 : (pos := match.start())]' and 'data[pos + 1:]'. Could you explain these?

Thank you






(May-15-2023, 07:51 PM)Gribouillis Wrote:
(May-15-2023, 06:58 PM)ZZTurn Wrote: but at my level, I am struggling to make sense of the code.
I'll try to explain how it works. First I wrote a better version which may be more understandable
import re

def our_split(data):
    depth = 0
    pos = -1
    substrings = []

    def repl(match):
        nonlocal depth, pos
        match match.group(0):
            case ',':
                if depth == 0:
                    substrings.append(
                        data[pos + 1 : (pos := match.start())])
            case '[':
                depth += 1
            case ']':
                depth -= 1

    re.sub(r'[\[\],]', repl, data)
    substrings.append(data[pos + 1:])
    return substrings

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = our_split(data)
    print(result)
  • The main statement is line 20 with re.sub(<regular expression>, <function>, <string>). When it meets this line, Python searches every occurrence of one of the characters ',' or '[' or ']' in the string and for every occurrence it executes the function repl(), passing it a «match» object as argument, which contains essentially the character in question and its position in the string.
  • For each match, the repl() function checks the character and updates the depth. If the character is [ or ] it increases or decreases the depth, and when the character is a comma, if the depth is 0, it appends a new substring to the list of substrings created so far, which range goes from the position of the last matched significant comma to the current comma.
  • At line 21, we append a last substring that goes from the last significant comma met to the end of the string and we return the list of substrings.
  • In this code, I used structural pattern matching, which is new in Python 3.10 (the part with match...case...case).
Output:
['Richtung route', ' trend', ' way [route, direction]', ' tendency [political etc.]', ' course [direction]', ' direction [course, route]']
Reply
#8
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())]
pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1. The position of the comma in the current match is match.start(). I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator :=. So the new value of pos is the position of the comma in the current match.
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:]
Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].
Reply
#9
I see, that's very clever, thank you. I didn't know about the walrus operator, looks like that's fairly new too.


(May-15-2023, 09:50 PM)Gribouillis Wrote:
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1 : (pos := match.start())]
pos is the position of the last encountered significant comma. I don't want this comma in the result, so I start the substring at position pos + 1. The position of the comma in the current match is match.start(). I use this as the end of the substring because in a string slice data[a:b], the character at position b is not included. In the meantime, I update the pos variable by using the walrus operator :=. So the new value of pos is the position of the comma in the current match.
(May-15-2023, 09:38 PM)ZZTurn Wrote: data[pos + 1:]
Again, pos is the position of the last significant comma (-1 if no comma). I append the substring that goes from position pos + 1 to the end of the string. This is the slice syntax data[a:].
Reply
#10
An alternative to using re.sub() is re.finditer(). This introduces an explicit loop instead of a callback function but the advantage is that the our_split() function becomes a generator which is cleaner.
import re

def our_split(data):
    depth = 0
    pos = -1

    for match in re.finditer(r'[\[\],]', data):
        match match.group(0):
            case ',':
                if depth == 0:
                    yield data[pos + 1 : (pos := match.start())]
            case '[':
                depth += 1
            case ']':
                depth -= 1

    yield data[pos + 1:]

if __name__ == '__main__':
    data = "Richtung route, trend, way [route, direction], tendency [political etc.], course [direction], direction [course, route]"

    result = list(our_split(data))
    print(result)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python Read each xlsx file and write it into csv with pipe delimiter mg24 4 1,475 Nov-09-2023, 10:56 AM
Last Post: mg24
  Read csv file with inconsistent delimiter gracenz 2 1,208 Mar-27-2023, 08:59 PM
Last Post: deanhystad
  How does open context manager work? deanhystad 7 1,354 Nov-08-2022, 02:45 PM
Last Post: deanhystad
  Delimiter issue with a CSV file jehoshua 1 1,303 Apr-19-2022, 01:28 AM
Last Post: jehoshua
  Decimal context stevendaprano 1 1,048 Apr-11-2022, 09:44 PM
Last Post: deanhystad
  How to create new line '/n' at each delimiter in a string? MikeAW2010 3 2,854 Dec-15-2020, 05:21 PM
Last Post: snippsat
  Case sensitive checks kam_uk 2 2,020 Nov-30-2020, 01:25 AM
Last Post: bowlofred
  copy content of text file with three delimiter into excel sheet vinaykumar 0 2,364 Jul-12-2020, 01:27 PM
Last Post: vinaykumar
  How to print string multiple times separated by delimiter Mekala 1 1,914 May-23-2020, 09:21 AM
Last Post: Yoriz
  TextIOWrapper.tell() with Python 3.6.9 in context of 0D/0A fschaef 0 2,086 Mar-29-2020, 09:18 AM
Last Post: fschaef

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020