Feb-22-2017, 04:41 PM
actually this is simpler
I'm not sure why I decided it doesn't work
r'.?\|BS\|'
I'm not sure why I decided it doesn't work
Regex: Remove all match plus one char before all
|
Feb-22-2017, 04:41 PM
actually this is simpler
r'.?\|BS\|' I'm not sure why I decided it doesn't work
Feb-22-2017, 04:56 PM
Because if you have
...x|BS||BS||BS| , you get ...|BS because the last |BS| eats the last character of the second one. But that can work if you make sure you have only one substitution per call to re.sub().
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Feb-22-2017, 06:44 PM
@Ofnuts, thanks. It was before I left office and then I decided that I'm wrong. Anyway the presumption is the code from previous post with single substitution per call
Feb-23-2017, 05:20 PM
As I am rather ignorant of re and repeated iterations for "linear" problem look ugly, I tried direct naive approach - just traverse through string, copy characters or delete last one if |BS| is found.
def naive(s=None, n=1000): if not s: s = 'it |BS||BS||BS|this is one|BS||BS||BS|an example' * n else: s = s * n size, ind = len(s), 0 accumulator = [] while ind < size: if (s[ind] == '|') and (ind+4 <= size) and (s[ind:ind+4] == "|BS|"): if accumulator: accumulator.pop() ind += 4 else: accumulator.append(s[ind]) ind += 1 return "".join(accumulator)Stolen buran's measuring script gives me: So it seems that:
That is interesting observation. I'm curious if using collections.deque for the accumulator could speed it even further for long strings.
As per the docs: Quote:Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction. It looks like the benefit is when pop or insert on/from the left side of the deque.
Feb-23-2017, 07:53 PM
I spent some time with line profiler, s repeated 3000 times -> single string with 90000 chars.
Naive approach: It seems that the growing of accumulator is not so bad. And as most of time is spent with simple operations, its hard to improve this while iterating over all characters. Perhaps processing it in chunks (find next |BS|, copy except last one, find next |BS|, copy/delete) could speed it significantly, but that would need some effort to keep parts of a string in list, eventually deleting them and going to previous ones if tons of consecutive |BS|'s is met. Ofnuts' function:First time I just skimmed through it, so I didnt notice that it removes one "level" of |BS| each pass . There are at most three consecutive |BS| in the example string, after fourth iteration string is same as after third iteration and it stops. I checked it more carefully and noticed that it behaves differently than a naive approach. Naive traversing "uses" only |BS|'s that were in the initial string. There is no returning except deleting, so no new |BS|. All other functions check repeatedly entire string, so newly generated |BS| are processed. I am not sure what is right here - it depends on exact definition of problem. Example of string generating new |BS|: Almost all time is spent in four re.sub passes: alfalfa's function:buran's first: buran's second:
Feb-23-2017, 08:57 PM
I think Alfalfa didn't expect that his question would start such comprehensive and interesting discussion on the subject :-)
Feb-23-2017, 09:37 PM
The lazy way to optimize with PyPy
So a test with python27, 36 and PyPy. PyPy smash all in short string test,something strange happens with alfalfa in long string test. So look like this: # Python27 λ py -2 re_test_27.py repeat 100000, short string: alfalfa --> 3.5496031518 buran1 --> 1.52491091255 buran2 --> 2.57949606539 ofnut --> 4.2571897546 naive --> 1.13338769417 repeat 10, long string alfalfa --> 38.5974750643 buran1 --> 14.4288213841 buran2 --> 20.0439010109 ofnut --> 0.106811476992 naive --> 0.0962549700541 ------------------------------- # Python36 λ python re_test_36.py repeat 100000, short string: alfalfa --> 3.2602623716768764 buran1 --> 1.0042334305819614 buran2 --> 2.1202766716410055 ofnut --> 3.5325847427125536 naive --> 1.222133670789903 repeat 10, long string alfalfa --> 50.20940404113115 buran1 --> 18.295298191151325 buran2 --> 25.17509017627549 ofnut --> 0.1061772336586273 naive --> 0.11889537011441575 ------------------------------------ # PyPy 5.4 λ pypy re_test_36.py repeat 100000, short string: alfalfa --> 0.80075443804 buran1 --> 0.276574856138 buran2 --> 0.383053332696 ofnut --> 0.46037204103 naive --> 0.0783222464252 repeat 10, long string alfalfa --> 109.683794765 buran1 --> 14.4731815182 buran2 --> 15.8616106454 ofnut --> 0.0287024597269 naive --> 0.010870755303 (Feb-23-2017, 07:53 PM)zivoni Wrote: Ofnuts -> "|BS|" # deletes repeatedly, but last one is not catched - doesn't match .\|BS\|Well... of course, because the mission is to delete a character in front a |BS|. The specs are a bit vague about what should happen when there is no character in front of the BS. This behavior is coherent with not erasing characters from another |BS|. A difference between your naive function and the regexp one is that the regexp execution is directed by the longest sequence of consecutive |BS| , while the naive one is just directed by the total length. So the regexp is faster if there are few '|BS|' separated by long strings. But the overall winner of my tests is a less-naive function, that doesn't really look at each character:def lessnaive(s): copied = [] previous=0 while True: current=s.find('|BS|',previous) if current<0: break copied.extend(s[previous:current]) if copied: copied.pop() previous=current+4 copied.extend(s[previous:]) return ''.join(copied)Its worst case scenario (all |BS| ) is only 2.5 slower than naive() , but as the ratio of characters to |BS| increases, it becomes much faster.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net (Feb-23-2017, 10:46 PM)Ofnuts Wrote: Well... of course, because the mission is to delete a character in front a |BS|. The specs are a bit vague about what should happen when there is no character in front of the BS. This behavior is coherent with not erasing characters from another |BS|. I think that specs are quite vague. But if the task was to emulate the backspace on a keyboard, then even with an empty buffer the backspace is consumed (usually). And as its processed sequentially, there is never previous |BS|. Your implementation of less_naive is exactly the improvement I mentioned before - find first |BS|, copy chunk of chars without last one, find next one and so on. For some reason (now I see that groundless one) I thought that it would need little more effort to keep right, and as naive was fast enough, I didnt bother to try it. |
|