Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
regex question
#1
Hi,
I'm not a regex hero, but I read everywhere it's fast and I should have a look into it.
I need a particular substitution that looks very much like my example below.
So I put together a primitive test. regex.sub(...) against string.replace(...)
It's a no brainer, string.replace() wins, hands in pockets. Unless of course I'm missing something Cool
Any ideas ?

import regex as re
import datetime

str = 'enristoda#' * 50
lst = []
lst2 = []
for _ in range(1_000_000):
    lst.append(str)
    lst2.append(str)
print('start: ', datetime.datetime.now())
for idx, item in enumerate(lst):
    item = re.sub('e','%',item)
    item = re.sub('n','-',item)
    item = re.sub('a','$',item)
    lst[idx] = item   
print(lst[0])
print('end regex: ', datetime.datetime.now())

print('start: ', datetime.datetime.now())
for idx, item in enumerate(lst2):
    x = item.replace('e','%')
    y = x.replace('n','-')
    z = y.replace('a','$')
    lst2[idx] = z   
print(lst2[0])
print('end replace: ', datetime.datetime.now())
(Edit: made 2 different lists for replace and regex, no change though)
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
(Jun-18-2022, 09:20 AM)DPaul Wrote: It's a no brainer, string.replace() wins, hands in pockets. Unless of course I'm missing something Cool
Any ideas ?
Yes string.replace() is faster than re.sub(),so the rule if can use string.replace() then do that.
Try to this task with string.replace()😵
import re

target_str = "Jessa  Knows Testing    And Machine     Learning \t \n"
res_str = re.sub(r"\s+", " ", target_str)
print(res_str)
Output:
Jessa Knows Testing And Machine Learning
So regex has a lot more power when it come to more complex tasks.

Python has own timeit,so you don't need to write own loop an use datetime.
Also using re.compile() make regex faster,but is still slower than replace.
import timeit

re_test = '''\
import re

s = 'hello 123'
patten = re.compile(r'\d+')
result_re = patten.sub('789', s)'''

str_test = '''\
s = 'hello 123'
result_str = s.replace('123', '789')'''

print(timeit.Timer(stmt=re_test).timeit(number=1000000))
print(timeit.Timer(stmt=str_test).timeit(number=1000000))
Output:
1.5085931999783497 0.13342030000058003
Reply
#3
OK, Snippsat thanks.
String.replace() is not "just" faster, it is 45 times faster than regex.

Are you challenging me to beat regex in doing these replacements?
target_str = "Jessa Knows Testing And Machine Learning \t \n"

Looks like a sunday morning job.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#4
(Jun-18-2022, 02:24 PM)DPaul Wrote: String.replace() is not "just" faster, it is 45 times faster than regex.
I most cases it dos not matter at all,if run one time will never feel the difference.
(Jun-18-2022, 02:24 PM)DPaul Wrote: Are you challenging me to beat regex in doing these replacements?
You can try i did a test just now and it did not look good with replace(need many calls).
Then a surprise need to this with more strings.
As you see regex has no problem as the pattern will work for many different scenarios,
with replace has to start from scratch on the new string.
import re

#target_str = "Jessa  Knows Testing    And Machine     Learning \t \n"
target_str = "Paul  Knows Testing  And   Machine Learning\n\t  \n"
res_str = re.sub(r"\s+", " ", target_str)
print(res_str)
Output:
Paul Knows Testing And Machine Learning
Reply
#5
Dear snippsat,

I need to do this for hundreds of thousands of records. Time is money !

Sunday morning, as I said.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#6
@snippsat: surely nobody in his right mind would do this with string.replace.
Below your example, which I made a little larger, to have some start of a challenge.
My solution is still a good 5 times faster than regex. And also one line of simple code.
I must admit i have learned a couple of things from these proceedings (timeit..)
It would seem that:
- Not only is regex slow, i have also noticed it gets even slower as the string gets longer.
- The image of a museum dinosaur comes into mind . All complicated bones, no meat.
- The argument you propose that in small samples you don't notice the difference ... what is it's target audience then? Confused
Paul
target_str = "Paul  Knows Testing  And   Machine Learning\n\t  \n" * 10
targets = []
targets2 = []

for _ in range(1_000_000):
    targets.append(target_str)
    targets2.append(target_str)
    
print('*' * 25)
print('start: ', datetime.datetime.now())
for idx, t in enumerate(targets):
    targets[idx] = re.sub(r"\s+", " ", t)
print(targets[0])
print('end regex: ', datetime.datetime.now())

print('*' * 25)  
print('start: ', datetime.datetime.now())
for idx, t in enumerate(targets2):
    targets2[idx] = ' '.join(x for x in t.strip().split())

print(targets2[0])
print('End: ', datetime.datetime.now())
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#7
Epilog.
It is very important for me to reduce processing time.
Even 10 or 20 % would save hours.
Some time ago I invested in a regex hardcopy book, but the effort/performance ratio leaves to be desired.
(I can understand that these are ideal exam questions.)
In all fairness : except for 1 statement I use a lot . I.e. replacing all but uppercase or digits in a string.
I cannot beat :
newstr= re.sub("[^A-Z0-9]", " ",oldstr)
On a million transactions it is marginally faster than any other replace I tried.
The jury is still out.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020