Regex: Remove all match plus one char before all

***Ofnuts*** · (This post was last modified: Feb-21-2017, 09:22 PM by Ofnuts.)

Warning, aspirin required.

This is quite tricky because you can have anything before |BS|, including another |BS|. And covering your rear with something such as [\|]|BS| isn't general enough because it prevents backspacing over a |. and in a regexp you can't express something like "not this string"...

So, you have to attack at the other end: use a regexp that will match any character followed by whole sequence of consecutive |BS|. Due to the greedy way things are matched, this will always include the whole sequence of consecutive |BS|, so you initial character cannot be itself part of a |BS|.

Then look at the fine print in the specs of re.sub(), it looks for non-overlapping occurences of the pattern, so the search for the next match starts after the end of the current match... which is after the end of the sequence of |BS|, so in a sequence of |BS|you will only process one per call to sub().

So in practice, we look for a character followed by a |BS| followed by zero or more other |BS| (captured in a group) and replace that by just that captured group:

import re

pattern=re.compile(r'.\|BS\|((\|BS\|)*)')

def noBS(s):
    print '------------'
    previous=''
    while s!=previous:
        previous=s
        s=re.sub(pattern,r'\1',s)
        print s # this shows that the two sequences of |BS| are processed in parallel 
    return s

print noBS("it |BS||BS||BS|this is one|BS||BS||BS|an example")
print noBS("it |BS||BS||BS| |BS|this is one|BS||BS||BS|an example")
print noBS("it |BS||BS||BS| |BS|this is o n  e|BS||BS||BS||BS||BS||BS|an example")
# The first 'BS|' gets backspaced over due to missing leading '|'... 
print noBS("it BS||BS||BS||BS||BS||BS||BS|this is o n  e|BS||BS||BS||BS||BS||BS|an example")

Output for he last one:

Output:it BS|BS||BS||BS||BS||BS|this is o n  |BS||BS||BS||BS||BS|an example
it B|BS||BS||BS||BS|this is o n |BS||BS||BS||BS|an example
it |BS||BS||BS|this is o n|BS||BS||BS|an example
it|BS||BS|this is o |BS||BS|an example
i|BS|this is o|BS|an example
this is an example

Unfortunately, I don't think you can avoid n explicit iteration.

wavic · (This post was last modified: Feb-21-2017, 10:57 PM by wavic.)

I have to learn regular expressions at last Angry

>>> import re
>>> s = 'it BS|BS||BS||BS||BS||BS|this is o n  |BS||BS||BS||BS||BS|an example'
>>> new_s = s.replace('|BS|', '\b')
>>> new_s
'it BS\x08\x08\x08\x08\x08this is o n  \x08\x08\x08\x08\x08an example'
>>> while '\x08' in new_s:
...     new_s = re.sub('[^\x08]\x08', '', new_s)
...     
>>> new_s
'this is an example'
>>>

Thanks to this

Alfalfa · Feb-22-2017, 12:45 AM

Thanks for all your reply. After testing them all, I ended up using buran's code which is the faster and work as expected.

buran: 2.6e-06s
wavic: 2.1e-05s (infinite loop if the string begin with |BS|)
ofnuts: 6.3e-05s

pattern = re.compile(r'[\w ]?\|BS\|')
buffer = "it BS|BS||BS||BS||BS||BS|this is o n  |BS||BS||BS||BS||BS|an example"
while True:
   after_sub = pattern.sub('', buffer, count=1)
   if buffer == after_sub:
       break
   else:
       buffer = after_sub
print(buffer)

wavic · Feb-22-2017, 12:54 AM

See the link in my prev. post. Actually this one. As I said, I don't know regular expressions Rolleyes

**buran** · Feb-22-2017, 02:45 AM

check this one, I think this should speed it up, because the each |BS| group and preceding chars are replaced in one re.sub

#!/usr/bin/python3
import re

strings = ['it |BS||BS||BS|this is one|BS||BS||BS|an example',
           'it |BS|this is an example',
           'it |BS||BS|this is an example',
           'it |BS||BS||BS|this is an example',
           'it |BS||BS||BS||BS|this is an example',
           'this one|BS||BS||BS||BS|it |BS||BS||BS||BS|']
ptrn = re.compile(r'(\|BS\|)+')
for string in strings:
   print(string)
   while True:
       match = re.search(ptrn, string)
       if match:
           num_chars = min(match.start(), int(len(match.group())/4))
           sub_pattern = re.compile(r'[\w ]{{{}}}(\|BS\|)+'.format(num_chars))
           string = sub_pattern.sub('', string, count=1)
       else:
           break
   print(string)
   print('\n')

Also note the last test string, it's a border case when later |BS| group will delete chars preceding previous |BS| group.

Alfalfa · (This post was last modified: Feb-22-2017, 04:29 AM by Alfalfa.)

This one is slightly slower, it takes about 4.1e-06s to execute. With both codes I noticed it block when encountering a special char:

input: "it BS|BS||BS||BS||BS||BS|this is one|BS||BS|an example"
outpt: "this is an example"

input: "it BS|BS||BS||BS||BS||BS|this is on.e|BS||BS||BS|an example"
outpt: "this is on.an example"

Actually, I timed the original code, and I am quite amazed to realize it is the fastest with an average of 1.7e-06s.. !

**buran** · (This post was last modified: Feb-22-2017, 08:01 AM by buran.)

I don't know how you time it but here is what I get:

import re
import timeit

def alfalfa(input_str=None, n=1000):
   if not input_str:
       string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       string = input_str
   while re.search("\|BS\|", string):
       array = list(string)
       for m in re.finditer("\|BS\|", string):
           del array[m.start():m.end()]
           if m.start()-1 >= 0:
               del array[m.start()-1]
           string = ''.join(array)
           break
   return string

def buran1(input_str = None, n=1000):
   ptrn = re.compile(r'[\w ]?\|BS\|')
   if not input_str:
       string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       string = input_str
   while True:
       after_sub = ptrn.sub('', string, count=1)
       if string == after_sub:
           break
       else:
           string = after_sub
   return string
   
def buran2(input_str=None, n=1000):
   ptrn = re.compile(r'(\|BS\|)+')
   if not input_str:
       string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       string = input_str
   while True:
       match = re.search(ptrn, string)
       if match:
           num_chars = min(match.start(), int(len(match.group())/4))
           sub_pattern = re.compile(r'[\w ]{{{}}}(\|BS\|)+'.format(num_chars))
           string = sub_pattern.sub('', string, count=1)
       else:
           break
   return string
   
def noBS(s=None, n=1000):
   if not s:
       s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       s = s*n
   pattern=re.compile(r'.\|BS\|((\|BS\|)*)')
   previous=''
   while s!=previous:
       previous=s
       s=re.sub(pattern,r'\1',s) 
   return s

if __name__ == '__main__':
   print 'repeat 1000, short string:\n'
   print 'alfalfa --> {}'.format(timeit.timeit("alfalfa(n=1)", number=1000, setup="from __main__ import alfalfa"))
   print 'buran1 --> {}'.format(timeit.timeit("buran1(n=1)", number=1000, setup="from __main__ import buran1"))
   print 'buran2 --> {}'.format(timeit.timeit("buran2(n=1)", number=1000, setup="from __main__ import buran2"))
   print 'ofnut --> {}'.format(timeit.timeit("noBS(n=1)", number=1000, setup="from __main__ import noBS"))
   print '\nrepeat 1, long string\n'
   print 'alfalfa --> {}'.format(timeit.timeit("alfalfa()", number=1, setup="from __main__ import alfalfa"))
   print 'buran1 --> {}'.format(timeit.timeit("buran1()", number=1, setup="from __main__ import buran1"))
   print 'buran2 --> {}'.format(timeit.timeit("buran2()", number=1, setup="from __main__ import buran2"))
   print 'ofnut --> {}'.format(timeit.timeit("noBS()", number=1, setup="from __main__ import noBS"))

and the result of two consecutive runs:

Output:repeat 1000, short string:

alfalfa --> 0.0432239843385
buran1 --> 0.0112259009714
buran2 --> 0.0158689890339
ofnut --> 0.0273017555023

repeat 1, long string

alfalfa --> 3.50733362241
buran1 --> 1.34837528801
buran2 --> 1.86298544437
ofnut --> 0.0084199068111

repeat 1000, short string:

alfalfa --> 0.0284217156815
buran1 --> 0.00996738901746
buran2 --> 0.0157894500521
ofnut --> 0.0273982927342

repeat 1, long string

alfalfa --> 3.52313333556
buran1 --> 1.35965603239
buran2 --> 1.82195551718
ofnut --> 0.00834742370672

Alfalfa · Feb-22-2017, 02:57 PM

That is strange, I simply used a for loop and made an average, like so;

#!/usr/bin/python3
import re
import time

pattern = re.compile(r'[\w ]?\|BS\|')
buffer = "it |BS||BS||BS|this is one|BS||BS||BS|an example" #|BS| as in Backspace
test=time.time()

for x in range(0,100000):
   while re.search("\|BS\|", buffer):
       array = list(buffer)
       for m in re.finditer("\|BS\|", buffer):
           del array[m.start():m.end()]
           if m.start()-1 >= 0:
               del array[m.start()-1]
           buffer = ''.join(array)
           break
print(buffer)
print((time.time()-test)/100000)

I though it might be python 3 vs 2, altough with the example you provided I get similar results as what you just showed..
Anyhow, do you know how to fix the pattern in order to accept non-alphanumeric chars?

**buran** · Feb-22-2017, 03:30 PM

I think r'[\S\s]?\|BS\|' should work

Alfalfa · Feb-22-2017, 04:01 PM

It seems to work great. Thank you for the extended support

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Facing issue in python regex newline match	Shr	6	6,250	Oct-25-2023, 09:42 AM Last Post: Shr
	How to split a String from Text Input into 40 char chunks?	lastyle	7	2,645	Aug-01-2023, 09:36 AM Last Post: Pedroski55
	Failing regex, space before and after the "match"	tester_V	6	2,704	Mar-06-2023, 03:03 PM Last Post: deanhystad
	Regex pattern match	WJSwan	2	2,838	Feb-07-2023, 04:52 AM Last Post: WJSwan
	Match substring using regex	Pavel_47	6	2,611	Jul-18-2022, 07:46 AM Last Post: Pavel_47
	Match key-value json,Regex	saam	5	7,737	Dec-07-2021, 03:06 PM Last Post: saam
	How to replace on char with another in a string?	korenron	3	3,089	Dec-03-2020, 07:37 AM Last Post: korenron
	How to remove char from string??	ridgerunnersjw	2	3,294	Sep-30-2020, 03:49 PM Last Post: ridgerunnersjw
	regex.findall that won't match anything	xiaobai97	1	2,732	Sep-24-2020, 02:02 PM Last Post: DeaD_EyE
	Creating new list based on exact regex match in original list	interjectdirector	1	3,376	Mar-08-2020, 09:30 PM Last Post: deanhystad

Regex: Remove all match plus one char before all

User Panel Messages

Announcements