Regex: Remove all match plus one char before all

**buran** · Feb-22-2017, 04:41 PM

actually this is simpler r'.?\|BS\|'
I'm not sure why I decided it doesn't work

***Ofnuts*** · Feb-22-2017, 04:56 PM

Because if you have ...x|BS||BS||BS|, you get ...|BS because the last |BS| eats the last character of the second one. But that can work if you make sure you have only one substitution per call to re.sub().

**buran** · Feb-22-2017, 06:44 PM

@Ofnuts, thanks. It was before I left office and then I decided that I'm wrong. Anyway the presumption is the code from previous post with single substitution per call

***zivoni*** · Feb-23-2017, 05:20 PM

As I am rather ignorant of re and repeated iterations for "linear" problem look ugly, I tried direct naive approach - just traverse through string, copy characters or delete last one if |BS| is found.

def naive(s=None, n=1000):
    if not s:
        s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
    else:
        s = s * n
    size, ind = len(s), 0
    accumulator = []
    while ind < size:
        if (s[ind] == '|') and (ind+4 <= size) and (s[ind:ind+4] == "|BS|"):
            if accumulator:
                accumulator.pop()
            ind += 4     
        else:
            accumulator.append(s[ind])
            ind += 1
    return "".join(accumulator)

Stolen buran's measuring script gives me:

Output:repeat 1000, short string s*1:

alfalfa --> 0.03940597499968135
buran1  --> 0.015949830999488768
buran2  --> 0.02403012300055707
ofnut   --> 0.04103327599932527
naive   --> 0.010694067999793333

repeat 1, long string s*1000

alfalfa --> 4.176700068000173
buran1  --> 2.2705873200002316
buran2  --> 3.068299023000691
ofnut   --> 0.014837658000033116
naive   --> 0.009339336000266485

repeat 1, very long string s*3000

alfalfa --> 40.76510821000011
buran1  --> 21.95231995500035
buran2  --> 29.43507453300026
ofnut   --> 0.04528477199983172
naive   --> 0.03035950000048615

So it seems that:

buran got faster PC
naive is fastest, but it doesnt improve for longer strings as much as I would guess - using re and repeatly search entire string should be quadratic (and from timing it is), while traversing should be linear - perhaps accumulator "growing" and char/substring checking is rather expensive compared to optimalized re

It should be possible to combine both approaches - find first occurence of |BS| with .find(), copy first parst of string except last char, start search from previous |BS| and so on. Or preallocate accumulator with [None] * size and keep second index to mark end of copied part to avoid growing and deleting chars from end. But gains probably wold be marginal and I am lazy.

**buran** · (This post was last modified: Feb-23-2017, 05:48 PM by buran.)

That is interesting observation. I'm curious if using collections.deque for the accumulator could speed it even further for long strings.
As per the docs:

Quote:Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction.

Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

It looks like the benefit is when pop or insert on/from the left side of the deque.

***zivoni*** · Feb-23-2017, 07:53 PM

I spent some time with line profiler, s repeated 3000 times -> single string with 90000 chars.

Naive approach:

It seems that the growing of accumulator is not so bad. And as most of time is spent with simple operations, its hard to improve this while iterating over all characters. Perhaps processing it in chunks (find next |BS|, copy except last one, find next |BS|, copy/delete) could speed it significantly, but that would need some effort to keep parts of a string in list, eventually deleting them and going to previous ones if tons of consecutive |BS|'s is met.

Output:In [5]: %lprun -f naive naive(n=3000)
Timer unit: 1e-06 s

Total time: 0.344973 s
File: /tmp/bs.py
Function: naive at line 66

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    66                                           def naive(s=None, n=1000):
    67         1            4      4.0      0.0      if not s:
    68         1           46     46.0      0.0          s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
    69                                               else:
    70                                                   s = s * n
    71         1            4      4.0      0.0      size, ind = len(s), 0
    72         1            3      3.0      0.0      accumulator = []
    73     90001        69329      0.8     20.1      while ind < size:
    74     90000        75776      0.8     22.0          if ((s[ind] == '|') 
    75     18000        15236      0.8      4.4              and (ind+4 <= size) 
    76     18000        18577      1.0      5.4              and (s[ind:ind+4] == "|BS|")):
    77     18000        13047      0.7      3.8              if accumulator:
    78     18000        15157      0.8      4.4                  accumulator.pop()
    79     18000        14372      0.8      4.2              ind += 4     
    80                                                   else:
    81     72000        65994      0.9     19.1              accumulator.append(s[ind])
    82     72000        56921      0.8     16.5              ind += 1
    83         1          507    507.0      0.1      return "".join(accumulator)

Ofnuts' function:

First time I just skimmed through it, so I didnt notice that it removes one "level" of |BS| each pass . There are at most three consecutive |BS| in the example string, after fourth iteration string is same as after third iteration and it stops.

I checked it more carefully and noticed that it behaves differently than a naive approach. Naive traversing "uses" only |BS|'s that were in the initial string. There is no returning except deleting, so no new |BS|. All other functions check repeatedly entire string, so newly generated |BS| are processed. I am not sure what is right here - it depends on exact definition of problem.

Example of string generating new |BS|:

Output:In [11]: s = "|Bu|Bu|BS|S|S|"
    ...: print('%-10s ->  "%s"' %('naive', naive(s, n=1)))
    ...: print('%-10s ->  "%s"' %('Ofnuts', noBS(s, n=1)))
    ...: print('%-10s ->  "%s"' %('alfalfa', alfalfa(s, n=1)))
    ...: print('%-10s ->  "%s"' %('buran1', buran1(s, n=1)))
    ...: print('%-10s ->  "%s"' %('buran2', buran2(s, n=1)))
    ...: 
naive      ->  "|Bu|BS|S|"     # there was only one |BS| in initial string, so one deletion
Ofnuts     ->  "|BS|"          # deletes repeatedly, but last one is not catched - doesn't match .\|BS\|
alfalfa    ->  ""              # last three iterate till there is nothing left ...
buran1     ->  ""
buran2     ->  ""

Almost all time is spent in four re.sub passes:

Output:In [6]: %lprun -f noBS noBS(n=3000)
Timer unit: 1e-06 s

Total time: 0.109533 s
File: /tmp/bs.py
Function: noBS at line 53

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    53                                           def noBS(s=None, n=1000):
    54         1            3      3.0      0.0      if not s:
    55         1           35     35.0      0.0          s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
    56                                               else:
    57                                                   s = s * n
    58         1           15     15.0      0.0      pattern = re.compile(r'.\|BS\|((\|BS\|)*)')
    59         1            2      2.0      0.0      previous = ''
    60         5           13      2.6      0.0      while s != previous:
    61         4            6      1.5      0.0          previous = s
    62         4       109458  27364.5     99.9          s = re.sub(pattern, r'\1', s)
    63         1            1      1.0      0.0      return s

alfalfa's function:

Hide/Show

Output:In [7]: %lprun -f alfalfa alfalfa(n=3000)
Timer unit: 1e-06 s

Total time: 41.3309 s
File: /tmp/bs.py
Function: alfalfa at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5                                           def alfalfa(input_str=None, n=1000):
     6         1            4      4.0      0.0      if not input_str:
     7         1           50     50.0      0.0          string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
     8                                               else:
     9                                                   string = input_str
    10     18001       429426     23.9      1.0      while re.search("\|BS\|", string):
    11     18000     20840472   1157.8     50.4          array = list(string)
    12     18000       637159     35.4      1.5          for m in re.finditer("\|BS\|", string):
    13     18000      1150019     63.9      2.8              del array[m.start():m.end()]
    14     18000        36054      2.0      0.1              if m.start() - 1 >= 0:
    15     18000       567248     31.5      1.4                  del array[m.start() - 1]
    16     18000     17625415    979.2     42.6              string = ''.join(array)
    17     18000        45094      2.5      0.1              break
    18         1            1      1.0      0.0      return string

buran's first:

Hide/Show

Output:In [8]: %lprun -f buran1 buran1(n=3000)
Timer unit: 1e-06 s

Total time: 21.5379 s
File: /tmp/bs.py
Function: buran1 at line 21

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           def buran1(input_str=None, n=1000):
    22         1           19     19.0      0.0      ptrn = re.compile(r'[\w ]?\|BS\|')
    23         1            2      2.0      0.0      if not input_str:
    24         1           43     43.0      0.0          string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
    25                                               else:
    26                                                   string = input_str
    27         1            2      2.0      0.0      while True:
    28     18001     21489449   1193.8     99.8          after_sub = ptrn.sub('', string, count=1)
    29     18001        24983      1.4      0.1          if string == after_sub:
    30         1            2      2.0      0.0              break
    31                                                   else:
    32     18000        23385      1.3      0.1              string = after_sub
    33         1            2      2.0      0.0      return string

buran's second:

Hide/Show

Output:In [9]: %lprun -f buran2 buran2(n=3000)
Timer unit: 1e-06 s

Total time: 28.4692 s
File: /tmp/bs.py
Function: buran2 at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           def buran2(input_str=None, n=1000):
    37         1           18     18.0      0.0      ptrn = re.compile(r'(\|BS\|)+')
    38         1            3      3.0      0.0      if not input_str:
    39         1           44     44.0      0.0          string = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
    40                                               else:
    41                                                   string = input_str
    42         1            2      2.0      0.0      while True:
    43      6001      8488591   1414.5     29.8          match = re.search(ptrn, string)
    44      6001        10278      1.7      0.0          if match:
    45      6000        36081      6.0      0.1              num_chars = min(match.start(), int(len(match.group()) / 4))
    46      6000        59462      9.9      0.2              sub_pattern = re.compile(r'[\w ]{{{}}}(\|BS\|)+'.format(num_chars))
    47      6000     19874706   3312.5     69.8              string = sub_pattern.sub('', string, count=1)
    48                                                   else:
    49         1            1      1.0      0.0              break
    50         1            1      1.0      0.0      return string

**buran** · Feb-23-2017, 08:57 PM

I think Alfalfa didn't expect that his question would start such comprehensive and interesting discussion on the subject :-)

***snippsat*** · Feb-23-2017, 09:37 PM

The lazy way to optimize with PyPy Cool

So a test with python27, 36 and PyPy.
PyPy smash all in short string test,something strange happens with alfalfa in long string test.
So look like this:

# Python27                           
λ py -2 re_test_27.py                
repeat 100000, short string:         
                                     
alfalfa --> 3.5496031518             
buran1 --> 1.52491091255             
buran2 --> 2.57949606539             
ofnut --> 4.2571897546               
naive --> 1.13338769417              
                                     
repeat 10, long string               
                                     
alfalfa --> 38.5974750643            
buran1 --> 14.4288213841             
buran2 --> 20.0439010109             
ofnut --> 0.106811476992             
naive --> 0.0962549700541            
-------------------------------                                     

# Python36                       
λ python re_test_36.py               
repeat 100000, short string:         
                                     
alfalfa --> 3.2602623716768764       
buran1 --> 1.0042334305819614        
buran2 --> 2.1202766716410055        
ofnut --> 3.5325847427125536         
naive --> 1.222133670789903          
                                     
repeat 10, long string               
                                     
alfalfa --> 50.20940404113115        
buran1 --> 18.295298191151325        
buran2 --> 25.17509017627549         
ofnut --> 0.1061772336586273         
naive --> 0.11889537011441575        
------------------------------------                                     

# PyPy 5.4                     
λ pypy re_test_36.py                 
repeat 100000, short string:         
                                     
alfalfa --> 0.80075443804            
buran1 --> 0.276574856138            
buran2 --> 0.383053332696            
ofnut --> 0.46037204103              
naive --> 0.0783222464252            
                                     
repeat 10, long string               
                                     
alfalfa --> 109.683794765            
buran1 --> 14.4731815182             
buran2 --> 15.8616106454             
ofnut --> 0.0287024597269            
naive --> 0.010870755303

***Ofnuts*** · (This post was last modified: Feb-23-2017, 10:46 PM by Ofnuts.)

(Feb-23-2017, 07:53 PM)zivoni Wrote: Ofnuts -> "|BS|" # deletes repeatedly, but last one is not catched - doesn't match .\|BS\|

Well... of course, because the mission is to delete a character in front a |BS|. The specs are a bit vague about what should happen when there is no character in front of the BS. This behavior is coherent with not erasing characters from another |BS|.

A difference between your naive function and the regexp one is that the regexp execution is directed by the longest sequence of consecutive |BS|, while the naive one is just directed by the total length. So the regexp is faster if there are few '|BS|' separated by long strings. But the overall winner of my tests is a less-naive function, that doesn't really look at each character:

def lessnaive(s):    
    copied = []
    previous=0
    while True:
        current=s.find('|BS|',previous)
        if current<0:
            break
        copied.extend(s[previous:current])
        if copied:
            copied.pop()
        previous=current+4
    
    copied.extend(s[previous:])
    return ''.join(copied)

Its worst case scenario (all |BS|) is only 2.5 slower than naive(), but as the ratio of characters to |BS| increases, it becomes much faster.

***zivoni*** · (This post was last modified: Feb-23-2017, 11:15 PM by zivoni.)

(Feb-23-2017, 10:46 PM)Ofnuts Wrote: Well... of course, because the mission is to delete a character in front a |BS|. The specs are a bit vague about what should happen when there is no character in front of the BS. This behavior is coherent with not erasing characters from another |BS|.

I think that specs are quite vague. But if the task was to emulate the backspace on a keyboard, then even with an empty buffer the backspace is consumed (usually). And as its processed sequentially, there is never previous |BS|.

Your implementation of less_naive is exactly the improvement I mentioned before - find first |BS|, copy chunk of chars without last one, find next one and so on. For some reason (now I see that groundless one) I thought that it would need little more effort to keep right, and as naive was fast enough, I didnt bother to try it.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Facing issue in python regex newline match	Shr	6	1,295	Oct-25-2023, 09:42 AM Last Post: Shr
	How to split a String from Text Input into 40 char chunks?	lastyle	7	1,133	Aug-01-2023, 09:36 AM Last Post: Pedroski55
	Failing regex, space before and after the "match"	tester_V	6	1,186	Mar-06-2023, 03:03 PM Last Post: deanhystad
	Regex pattern match	WJSwan	2	1,255	Feb-07-2023, 04:52 AM Last Post: WJSwan
	Match substring using regex	Pavel_47	6	1,429	Jul-18-2022, 07:46 AM Last Post: Pavel_47
	Match key-value json,Regex	saam	5	5,419	Dec-07-2021, 03:06 PM Last Post: saam
	How to replace on char with another in a string?	korenron	3	2,356	Dec-03-2020, 07:37 AM Last Post: korenron
	How to remove char from string??	ridgerunnersjw	2	2,555	Sep-30-2020, 03:49 PM Last Post: ridgerunnersjw
	regex.findall that won't match anything	xiaobai97	1	2,016	Sep-24-2020, 02:02 PM Last Post: DeaD_EyE
	Creating new list based on exact regex match in original list	interjectdirector	1	2,283	Mar-08-2020, 09:30 PM Last Post: deanhystad

Regex: Remove all match plus one char before all

User Panel Messages

Announcements