Python Forum
Regex: Remove all match plus one char before all
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex: Remove all match plus one char before all
#11
Warning, aspirin required.

This is quite tricky because you can have anything before |BS|, including another |BS|. And covering your rear with something such as [\|]|BS| isn't general enough because it prevents backspacing over a |. and in a regexp you can't express something like "not this string"...

So, you have to attack at the other end: use a regexp that will match any character followed by whole sequence of consecutive |BS|. Due to the greedy way things are matched, this will always include the whole sequence of consecutive |BS|, so you initial character cannot be itself part of a |BS|.

Then look at the fine print in the specs of re.sub(), it looks for non-overlapping occurences of the pattern, so the search for the next match starts after the end of the current match... which is after the end of the sequence of |BS|, so in a sequence of |BS|you will only process one per call to sub().

So in practice, we look for a character followed by a |BS| followed by zero or more other |BS| (captured in a group) and replace that by just that captured group:

import re

pattern=re.compile(r'.\|BS\|((\|BS\|)*)')

def noBS(s):
    print '------------'
    previous=''
    while s!=previous:
        previous=s
        s=re.sub(pattern,r'\1',s)
        print s # this shows that the two sequences of |BS| are processed in parallel 
    return s

print noBS("it |BS||BS||BS|this is one|BS||BS||BS|an example")
print noBS("it |BS||BS||BS| |BS|this is one|BS||BS||BS|an example")
print noBS("it |BS||BS||BS| |BS|this is o n  e|BS||BS||BS||BS||BS||BS|an example")
# The first 'BS|' gets backspaced over due to missing leading '|'... 
print noBS("it BS||BS||BS||BS||BS||BS||BS|this is o n  e|BS||BS||BS||BS||BS||BS|an example")
Output for he last one:
Output:
it BS|BS||BS||BS||BS||BS|this is o n  |BS||BS||BS||BS||BS|an example it B|BS||BS||BS||BS|this is o n |BS||BS||BS||BS|an example it |BS||BS||BS|this is o n|BS||BS||BS|an example it|BS||BS|this is o |BS||BS|an example i|BS|this is o|BS|an example this is an example
Unfortunately, I don't think you can avoid n explicit iteration.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#12
I have to learn regular expressions at last  Angry

>>> import re
>>> s = 'it BS|BS||BS||BS||BS||BS|this is o n  |BS||BS||BS||BS||BS|an example'
>>> new_s = s.replace('|BS|', '\b')
>>> new_s
'it BS\x08\x08\x08\x08\x08this is o n  \x08\x08\x08\x08\x08an example'
>>> while '\x08' in new_s:
...     new_s = re.sub('[^\x08]\x08', '', new_s)
...     
>>> new_s
'this is an example'
>>> 
Thanks to this
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#13
Thanks for all your reply. After testing them all, I ended up using buran's code which is the faster and work as expected.

buran: 2.6e-06s
wavic: 2.1e-05s (infinite loop if the string begin with |BS|)
ofnuts: 6.3e-05s

pattern = re.compile(r'[\w ]?\|BS\|')
buffer = "it BS|BS||BS||BS||BS||BS|this is o n  |BS||BS||BS||BS||BS|an example"
while True:
   after_sub = pattern.sub('', buffer, count=1)
   if buffer == after_sub:
       break
   else:
       buffer = after_sub
print(buffer)
Reply
#14
See the link in my prev. post. Actually this one. As I said, I don't know regular expressions  Rolleyes
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#15
check this one, I think this should speed it up, because the each |BS| group and preceding chars are replaced in one re.sub
#!/usr/bin/python3
import re

strings = ['it |BS||BS||BS|this is one|BS||BS||BS|an example',
           'it |BS|this is an example',
           'it |BS||BS|this is an example',
           'it |BS||BS||BS|this is an example',
           'it |BS||BS||BS||BS|this is an example',
           'this one|BS||BS||BS||BS|it |BS||BS||BS||BS|']
ptrn = re.compile(r'(\|BS\|)+')
for string in strings:
   print(string)
   while True:
       match = re.search(ptrn, string)
       if match:
           num_chars = min(match.start(), int(len(match.group())/4))
           sub_pattern = re.compile(r'[\w ]{{{}}}(\|BS\|)+'.format(num_chars))
           string = sub_pattern.sub('', string, count=1)
       else:
           break
   print(string)
   print('\n')
Also note the last test string, it's a border case when later |BS| group will delete chars preceding previous |BS| group.
Reply
#16
This one is slightly slower, it takes about 4.1e-06s to execute. With both codes I noticed it block when encountering a special char:

input: "it BS|BS||BS||BS||BS||BS|this is one|BS||BS|an example"
outpt: "this is an example"

input: "it BS|BS||BS||BS||BS||BS|this is on.e|BS||BS||BS|an example"
outpt: "this is on.an example"


Actually, I timed the original code, and I am quite amazed to realize it is the fastest with an average of 1.7e-06s.. !
Reply
#17
I don't know how you time it but here is what I get:

import re
import timeit

def alfalfa(input_str=None, n=1000):
   if not input_str:
       string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       string = input_str
   while re.search("\|BS\|", string):
       array = list(string)
       for m in re.finditer("\|BS\|", string):
           del array[m.start():m.end()]
           if m.start()-1 >= 0:
               del array[m.start()-1]
           string = ''.join(array)
           break
   return string

def buran1(input_str = None, n=1000):
   ptrn = re.compile(r'[\w ]?\|BS\|')
   if not input_str:
       string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       string = input_str
   while True:
       after_sub = ptrn.sub('', string, count=1)
       if string == after_sub:
           break
       else:
           string = after_sub
   return string
   
def buran2(input_str=None, n=1000):
   ptrn = re.compile(r'(\|BS\|)+')
   if not input_str:
       string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       string = input_str
   while True:
       match = re.search(ptrn, string)
       if match:
           num_chars = min(match.start(), int(len(match.group())/4))
           sub_pattern = re.compile(r'[\w ]{{{}}}(\|BS\|)+'.format(num_chars))
           string = sub_pattern.sub('', string, count=1)
       else:
           break
   return string
   
def noBS(s=None, n=1000):
   if not s:
       s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example'*n
   else:
       s = s*n
   pattern=re.compile(r'.\|BS\|((\|BS\|)*)')
   previous=''
   while s!=previous:
       previous=s
       s=re.sub(pattern,r'\1',s) 
   return s

if __name__ == '__main__':
   print 'repeat 1000, short string:\n'
   print 'alfalfa --> {}'.format(timeit.timeit("alfalfa(n=1)", number=1000, setup="from __main__ import alfalfa"))
   print 'buran1 --> {}'.format(timeit.timeit("buran1(n=1)", number=1000, setup="from __main__ import buran1"))
   print 'buran2 --> {}'.format(timeit.timeit("buran2(n=1)", number=1000, setup="from __main__ import buran2"))
   print 'ofnut --> {}'.format(timeit.timeit("noBS(n=1)", number=1000, setup="from __main__ import noBS"))
   print '\nrepeat 1, long string\n'
   print 'alfalfa --> {}'.format(timeit.timeit("alfalfa()", number=1, setup="from __main__ import alfalfa"))
   print 'buran1 --> {}'.format(timeit.timeit("buran1()", number=1, setup="from __main__ import buran1"))
   print 'buran2 --> {}'.format(timeit.timeit("buran2()", number=1, setup="from __main__ import buran2"))
   print 'ofnut --> {}'.format(timeit.timeit("noBS()", number=1, setup="from __main__ import noBS"))
and the result of two consecutive runs:

Output:
repeat 1000, short string: alfalfa --> 0.0432239843385 buran1 --> 0.0112259009714 buran2 --> 0.0158689890339 ofnut --> 0.0273017555023 repeat 1, long string alfalfa --> 3.50733362241 buran1 --> 1.34837528801 buran2 --> 1.86298544437 ofnut --> 0.0084199068111 repeat 1000, short string: alfalfa --> 0.0284217156815 buran1 --> 0.00996738901746 buran2 --> 0.0157894500521 ofnut --> 0.0273982927342 repeat 1, long string alfalfa --> 3.52313333556 buran1 --> 1.35965603239 buran2 --> 1.82195551718 ofnut --> 0.00834742370672
Reply
#18
That is strange, I simply used a for loop and made an average, like so;
#!/usr/bin/python3
import re
import time

pattern = re.compile(r'[\w ]?\|BS\|')
buffer = "it |BS||BS||BS|this is one|BS||BS||BS|an example" #|BS| as in Backspace
test=time.time()

for x in range(0,100000):
   while re.search("\|BS\|", buffer):
       array = list(buffer)
       for m in re.finditer("\|BS\|", buffer):
           del array[m.start():m.end()]
           if m.start()-1 >= 0:
               del array[m.start()-1]
           buffer = ''.join(array)
           break
print(buffer)
print((time.time()-test)/100000)
I though it might be python 3 vs 2, altough with the example you provided I get similar results as what you just showed..
Anyhow, do you know how to fix the pattern in order to accept non-alphanumeric chars?
Reply
#19
I think r'[\S\s]?\|BS\|' should work
Reply
#20
It seems to work great. Thank you for the extended support
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Facing issue in python regex newline match Shr 6 1,326 Oct-25-2023, 09:42 AM
Last Post: Shr
Sad How to split a String from Text Input into 40 char chunks? lastyle 7 1,157 Aug-01-2023, 09:36 AM
Last Post: Pedroski55
  Failing regex, space before and after the "match" tester_V 6 1,200 Mar-06-2023, 03:03 PM
Last Post: deanhystad
  Regex pattern match WJSwan 2 1,282 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  Match substring using regex Pavel_47 6 1,446 Jul-18-2022, 07:46 AM
Last Post: Pavel_47
  Match key-value json,Regex saam 5 5,438 Dec-07-2021, 03:06 PM
Last Post: saam
  How to replace on char with another in a string? korenron 3 2,373 Dec-03-2020, 07:37 AM
Last Post: korenron
  How to remove char from string?? ridgerunnersjw 2 2,570 Sep-30-2020, 03:49 PM
Last Post: ridgerunnersjw
  regex.findall that won't match anything xiaobai97 1 2,036 Sep-24-2020, 02:02 PM
Last Post: DeaD_EyE
  Creating new list based on exact regex match in original list interjectdirector 1 2,295 Mar-08-2020, 09:30 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020