Python Forum
Regex: Remove all match plus one char before all
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex: Remove all match plus one char before all
#21
actually this is simpler r'.?\|BS\|'
I'm not sure why I decided it doesn't work
Reply
#22
Because if you have ...x|BS||BS||BS|, you get ...|BS because the last |BS| eats the last character of the second one. But that can work if you make sure you have only one substitution per call to re.sub().
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#23
@Ofnuts, thanks. It was before I left office and then I decided that I'm wrong. Anyway the presumption is the code from previous post with single substitution per call
Reply
#24
As I am rather ignorant of re and repeated iterations for "linear" problem look ugly, I tried direct naive approach - just traverse through string, copy characters or delete last one if |BS| is found.

def naive(s=None, n=1000):
    if not s:
        s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n
    else:
        s = s * n
    size, ind = len(s), 0
    accumulator = []
    while ind < size:
        if (s[ind] == '|') and (ind+4 <= size) and (s[ind:ind+4] == "|BS|"):
            if accumulator:
                accumulator.pop()
            ind += 4     
        else:
            accumulator.append(s[ind])
            ind += 1
    return "".join(accumulator)
Stolen buran's measuring script gives me:

Output:
repeat 1000, short string s*1: alfalfa --> 0.03940597499968135 buran1  --> 0.015949830999488768 buran2  --> 0.02403012300055707 ofnut   --> 0.04103327599932527 naive   --> 0.010694067999793333 repeat 1, long string s*1000 alfalfa --> 4.176700068000173 buran1  --> 2.2705873200002316 buran2  --> 3.068299023000691 ofnut   --> 0.014837658000033116 naive   --> 0.009339336000266485 repeat 1, very long string s*3000 alfalfa --> 40.76510821000011 buran1  --> 21.95231995500035 buran2  --> 29.43507453300026 ofnut   --> 0.04528477199983172 naive   --> 0.03035950000048615
So it seems that:
  • buran got faster PC
  • naive is fastest, but it doesnt improve for longer strings as much as I would guess - using re and repeatly search entire string should be quadratic (and from timing it is), while traversing should be linear - perhaps accumulator "growing" and char/substring checking is rather expensive compared to optimalized re
It should be possible to combine both approaches - find first occurence of |BS| with .find(), copy first parst of string except  last char, start search from previous |BS| and so on. Or preallocate accumulator with [None] * size and keep second index to mark end of copied part to avoid growing and deleting chars from end. But gains probably wold be marginal and I am lazy.
Reply
#25
That is interesting observation. I'm curious if using collections.deque for the accumulator could speed it even further for long strings.
As per the docs:

Quote:Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction.

Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

It looks like the benefit is when pop or insert on/from the left side of the deque.
Reply
#26
I spent some time with line profiler, s repeated 3000 times -> single string with 90000 chars.

Naive approach:

It seems that the growing of accumulator is not so bad. And as most of time is spent with simple operations, its hard to improve this while iterating over all characters. Perhaps processing it in chunks (find next |BS|, copy except last one, find next |BS|, copy/delete) could speed it significantly, but that would need some effort to keep parts of a string in list, eventually deleting them and going to previous ones if tons of consecutive |BS|'s  is met.

Output:
In [5]: %lprun -f naive naive(n=3000) Timer unit: 1e-06 s Total time: 0.344973 s File: /tmp/bs.py Function: naive at line 66 Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     66                                           def naive(s=None, n=1000):     67         1            4      4.0      0.0      if not s:     68         1           46     46.0      0.0          s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n     69                                               else:     70                                                   s = s * n     71         1            4      4.0      0.0      size, ind = len(s), 0     72         1            3      3.0      0.0      accumulator = []     73     90001        69329      0.8     20.1      while ind < size:     74     90000        75776      0.8     22.0          if ((s[ind] == '|')     75     18000        15236      0.8      4.4              and (ind+4 <= size)     76     18000        18577      1.0      5.4              and (s[ind:ind+4] == "|BS|")):     77     18000        13047      0.7      3.8              if accumulator:     78     18000        15157      0.8      4.4                  accumulator.pop()     79     18000        14372      0.8      4.2              ind += 4          80                                                   else:     81     72000        65994      0.9     19.1              accumulator.append(s[ind])     82     72000        56921      0.8     16.5              ind += 1     83         1          507    507.0      0.1      return "".join(accumulator)
Ofnuts' function:

First time I just skimmed through it, so I didnt notice that it removes one "level" of |BS| each pass . There are at most three consecutive |BS| in the example string, after fourth iteration string is same as after third iteration and it stops.

I checked it more carefully and noticed that it behaves differently than a naive approach. Naive traversing "uses" only |BS|'s that were in the initial string. There is no returning except deleting, so no new |BS|. All other functions check repeatedly entire string, so newly generated |BS| are processed. I am not sure what is right here - it depends on exact definition of problem.

Example of string generating new |BS|:
Output:
In [11]: s = "|Bu|Bu|BS|S|S|"     ...: print('%-10s ->  "%s"' %('naive', naive(s, n=1)))     ...: print('%-10s ->  "%s"' %('Ofnuts', noBS(s, n=1)))     ...: print('%-10s ->  "%s"' %('alfalfa', alfalfa(s, n=1)))     ...: print('%-10s ->  "%s"' %('buran1', buran1(s, n=1)))     ...: print('%-10s ->  "%s"' %('buran2', buran2(s, n=1)))     ...: naive      ->  "|Bu|BS|S|"     # there was only one |BS| in initial string, so one deletion Ofnuts     ->  "|BS|"          # deletes repeatedly, but last one is not catched - doesn't match .\|BS\| alfalfa    ->  ""              # last three iterate till there is nothing left ... buran1     ->  "" buran2     ->  ""
Almost all time is spent in four re.sub passes:
Output:
In [6]: %lprun -f noBS noBS(n=3000) Timer unit: 1e-06 s Total time: 0.109533 s File: /tmp/bs.py Function: noBS at line 53 Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     53                                           def noBS(s=None, n=1000):     54         1            3      3.0      0.0      if not s:     55         1           35     35.0      0.0          s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n     56                                               else:     57                                                   s = s * n     58         1           15     15.0      0.0      pattern = re.compile(r'.\|BS\|((\|BS\|)*)')     59         1            2      2.0      0.0      previous = ''     60         5           13      2.6      0.0      while s != previous:     61         4            6      1.5      0.0          previous = s     62         4       109458  27364.5     99.9          s = re.sub(pattern, r'\1', s)     63         1            1      1.0      0.0      return s
alfalfa's function:
buran's first:
buran's second:
Reply
#27
I think Alfalfa didn't expect that his question would start such comprehensive and interesting discussion on the subject :-)
Reply
#28
The lazy way to optimize with PyPy Cool
So a test with python27, 36 and PyPy.
PyPy smash all in short string test,something strange happens with alfalfa in long string test.
So look like this:
# Python27                           
λ py -2 re_test_27.py                
repeat 100000, short string:         
                                     
alfalfa --> 3.5496031518             
buran1 --> 1.52491091255             
buran2 --> 2.57949606539             
ofnut --> 4.2571897546               
naive --> 1.13338769417              
                                     
repeat 10, long string               
                                     
alfalfa --> 38.5974750643            
buran1 --> 14.4288213841             
buran2 --> 20.0439010109             
ofnut --> 0.106811476992             
naive --> 0.0962549700541            
-------------------------------                                     

# Python36                       
λ python re_test_36.py               
repeat 100000, short string:         
                                     
alfalfa --> 3.2602623716768764       
buran1 --> 1.0042334305819614        
buran2 --> 2.1202766716410055        
ofnut --> 3.5325847427125536         
naive --> 1.222133670789903          
                                     
repeat 10, long string               
                                     
alfalfa --> 50.20940404113115        
buran1 --> 18.295298191151325        
buran2 --> 25.17509017627549         
ofnut --> 0.1061772336586273         
naive --> 0.11889537011441575        
------------------------------------                                     

# PyPy 5.4                     
λ pypy re_test_36.py                 
repeat 100000, short string:         
                                     
alfalfa --> 0.80075443804            
buran1 --> 0.276574856138            
buran2 --> 0.383053332696            
ofnut --> 0.46037204103              
naive --> 0.0783222464252            
                                     
repeat 10, long string               
                                     
alfalfa --> 109.683794765            
buran1 --> 14.4731815182             
buran2 --> 15.8616106454             
ofnut --> 0.0287024597269            
naive --> 0.010870755303
Reply
#29
(Feb-23-2017, 07:53 PM)zivoni Wrote: Ofnuts     ->  "|BS|"          # deletes repeatedly, but last one is not catched - doesn't match .\|BS\|
Well... of course, because the mission is to delete a character in front a |BS|. The specs are a bit vague about what should happen when there is no character in front of the BS. This behavior is coherent with not erasing characters from another |BS|.

A difference between your naive function and the regexp one is that the regexp execution is directed by the longest sequence of consecutive |BS|, while the naive one is just directed by the total length. So the regexp is faster if there are few '|BS|' separated by long strings. But the overall winner of my tests is a less-naive function, that doesn't really look at each character:

def lessnaive(s):    
    copied = []
    previous=0
    while True:
        current=s.find('|BS|',previous)
        if current<0:
            break
        copied.extend(s[previous:current])
        if copied:
            copied.pop()
        previous=current+4
    
    copied.extend(s[previous:])
    return ''.join(copied)
Its worst case scenario (all |BS|) is only 2.5 slower than naive(), but as the ratio of characters to |BS| increases, it becomes much faster.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#30
(Feb-23-2017, 10:46 PM)Ofnuts Wrote: Well... of course, because the mission is to delete a character in front a |BS|. The specs are a bit vague about what should happen when there is no character in front of the BS. This behavior is coherent with not erasing characters from another |BS|.

I think that specs are quite vague. But if the task was to emulate the backspace on a keyboard, then even with an empty buffer the backspace is consumed (usually). And as its processed sequentially, there is never previous |BS|.

Your implementation of less_naive is exactly the improvement I mentioned before - find first |BS|, copy chunk of chars without last one, find next one and so on. For some reason (now I see that groundless one) I thought that it would need little more effort to keep right,  and as naive was fast enough, I didnt bother to try it.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Facing issue in python regex newline match Shr 6 1,295 Oct-25-2023, 09:42 AM
Last Post: Shr
Sad How to split a String from Text Input into 40 char chunks? lastyle 7 1,133 Aug-01-2023, 09:36 AM
Last Post: Pedroski55
  Failing regex, space before and after the "match" tester_V 6 1,186 Mar-06-2023, 03:03 PM
Last Post: deanhystad
  Regex pattern match WJSwan 2 1,255 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  Match substring using regex Pavel_47 6 1,429 Jul-18-2022, 07:46 AM
Last Post: Pavel_47
  Match key-value json,Regex saam 5 5,419 Dec-07-2021, 03:06 PM
Last Post: saam
  How to replace on char with another in a string? korenron 3 2,356 Dec-03-2020, 07:37 AM
Last Post: korenron
  How to remove char from string?? ridgerunnersjw 2 2,555 Sep-30-2020, 03:49 PM
Last Post: ridgerunnersjw
  regex.findall that won't match anything xiaobai97 1 2,016 Sep-24-2020, 02:02 PM
Last Post: DeaD_EyE
  Creating new list based on exact regex match in original list interjectdirector 1 2,283 Mar-08-2020, 09:30 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020