Python Forum
Regex: Remove all match plus one char before all
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex: Remove all match plus one char before all
#26
I spent some time with line profiler, s repeated 3000 times -> single string with 90000 chars.

Naive approach:

It seems that the growing of accumulator is not so bad. And as most of time is spent with simple operations, its hard to improve this while iterating over all characters. Perhaps processing it in chunks (find next |BS|, copy except last one, find next |BS|, copy/delete) could speed it significantly, but that would need some effort to keep parts of a string in list, eventually deleting them and going to previous ones if tons of consecutive |BS|'s  is met.

Output:
In [5]: %lprun -f naive naive(n=3000) Timer unit: 1e-06 s Total time: 0.344973 s File: /tmp/bs.py Function: naive at line 66 Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     66                                           def naive(s=None, n=1000):     67         1            4      4.0      0.0      if not s:     68         1           46     46.0      0.0          s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n     69                                               else:     70                                                   s = s * n     71         1            4      4.0      0.0      size, ind = len(s), 0     72         1            3      3.0      0.0      accumulator = []     73     90001        69329      0.8     20.1      while ind < size:     74     90000        75776      0.8     22.0          if ((s[ind] == '|')     75     18000        15236      0.8      4.4              and (ind+4 <= size)     76     18000        18577      1.0      5.4              and (s[ind:ind+4] == "|BS|")):     77     18000        13047      0.7      3.8              if accumulator:     78     18000        15157      0.8      4.4                  accumulator.pop()     79     18000        14372      0.8      4.2              ind += 4          80                                                   else:     81     72000        65994      0.9     19.1              accumulator.append(s[ind])     82     72000        56921      0.8     16.5              ind += 1     83         1          507    507.0      0.1      return "".join(accumulator)
Ofnuts' function:

First time I just skimmed through it, so I didnt notice that it removes one "level" of |BS| each pass . There are at most three consecutive |BS| in the example string, after fourth iteration string is same as after third iteration and it stops.

I checked it more carefully and noticed that it behaves differently than a naive approach. Naive traversing "uses" only |BS|'s that were in the initial string. There is no returning except deleting, so no new |BS|. All other functions check repeatedly entire string, so newly generated |BS| are processed. I am not sure what is right here - it depends on exact definition of problem.

Example of string generating new |BS|:
Output:
In [11]: s = "|Bu|Bu|BS|S|S|"     ...: print('%-10s ->  "%s"' %('naive', naive(s, n=1)))     ...: print('%-10s ->  "%s"' %('Ofnuts', noBS(s, n=1)))     ...: print('%-10s ->  "%s"' %('alfalfa', alfalfa(s, n=1)))     ...: print('%-10s ->  "%s"' %('buran1', buran1(s, n=1)))     ...: print('%-10s ->  "%s"' %('buran2', buran2(s, n=1)))     ...: naive      ->  "|Bu|BS|S|"     # there was only one |BS| in initial string, so one deletion Ofnuts     ->  "|BS|"          # deletes repeatedly, but last one is not catched - doesn't match .\|BS\| alfalfa    ->  ""              # last three iterate till there is nothing left ... buran1     ->  "" buran2     ->  ""
Almost all time is spent in four re.sub passes:
Output:
In [6]: %lprun -f noBS noBS(n=3000) Timer unit: 1e-06 s Total time: 0.109533 s File: /tmp/bs.py Function: noBS at line 53 Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     53                                           def noBS(s=None, n=1000):     54         1            3      3.0      0.0      if not s:     55         1           35     35.0      0.0          s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n     56                                               else:     57                                                   s = s * n     58         1           15     15.0      0.0      pattern = re.compile(r'.\|BS\|((\|BS\|)*)')     59         1            2      2.0      0.0      previous = ''     60         5           13      2.6      0.0      while s != previous:     61         4            6      1.5      0.0          previous = s     62         4       109458  27364.5     99.9          s = re.sub(pattern, r'\1', s)     63         1            1      1.0      0.0      return s
alfalfa's function:
buran's first:
buran's second:
Reply


Messages In This Thread
RE: Regex: Remove all match plus one char before all - by zivoni - Feb-23-2017, 07:53 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Facing issue in python regex newline match Shr 6 1,596 Oct-25-2023, 09:42 AM
Last Post: Shr
Sad How to split a String from Text Input into 40 char chunks? lastyle 7 1,334 Aug-01-2023, 09:36 AM
Last Post: Pedroski55
  Failing regex, space before and after the "match" tester_V 6 1,345 Mar-06-2023, 03:03 PM
Last Post: deanhystad
  Regex pattern match WJSwan 2 1,409 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  Match substring using regex Pavel_47 6 1,572 Jul-18-2022, 07:46 AM
Last Post: Pavel_47
  Match key-value json,Regex saam 5 5,589 Dec-07-2021, 03:06 PM
Last Post: saam
  How to replace on char with another in a string? korenron 3 2,455 Dec-03-2020, 07:37 AM
Last Post: korenron
  How to remove char from string?? ridgerunnersjw 2 2,643 Sep-30-2020, 03:49 PM
Last Post: ridgerunnersjw
  regex.findall that won't match anything xiaobai97 1 2,121 Sep-24-2020, 02:02 PM
Last Post: DeaD_EyE
  Creating new list based on exact regex match in original list interjectdirector 1 2,392 Mar-08-2020, 09:30 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020