Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
RE output clarity
#1
Hi

I'd like to know why the output is only ['a'], instead of ['a', 'b', 'a'].
I thought that the presence of "|" would mach either a or b.
The presence of "+" is 1 or more, so I thought that it would attempt to match
all of "aba" as a result.

>>> import re
>>> re.findall(r'(a|b)+', 'aba')
['a']
Am using 3.6.5.
Reply
#2
(a|b)+ will be a repeated capturing group + and only capture the last iteration.
So it match the last a.
Can remove + and also ().
>>> import re
>>> 
>>> re.findall(r'(a|b)+', 'aba')
['a']
>>> re.findall(r'(a|b)', 'aba')
['a', 'b', 'a']
>>> re.findall(r'a|b', 'aba')
['a', 'b', 'a']
If thinking of group a or b in own group.
Then like this with use of re.search() and capture groups.
re.findall() just returns all captured groups.
>>> import re
>>> 
>>> r =  re.search(r'((a)|(b))+', 'aba')
>>> r.group(1)
'a'
>>> r.group(2)
'a'
>>> r.group(3)
'b'
Reply
#3
I think the trick is in the findall definition:
Output:
Return a list of all non-overlapping matches in the string.
So the RE '(a|b)+' applied over the string 'aba' produces the following sequence:
- Applied from char 0: Match 'aba'
- Applied from char 1: Match 'ba', so the previous match is discarded
- Applied from char 2: Match 's' and the previous match is discarded.
So it is not returning the 1st 'a' but the last... you can check it with re.findall(r'(a|b|c)+', 'abc') that returns ['c']

One way to obtain what you expect is to transform the RE from greedy to non-greedy or use single char matches:
>>> re.findall(r'(a|b)+?', 'aba')
['a', 'b', 'a']
>>> re.findall(r'[ab]', 'aba')
['a', 'b', 'a']
>>> re.findall(r'(a|b)', 'aba')
['a', 'b', 'a']
If what you really want is to match the full string, then you need an expression that does not create sub-matches, so findall returns the biggest possible:
>>> re.findall(r'(?:a|b)+', 'aba')
['aba']
>>> re.findall(r'[ab]+', 'aba')
['aba']
Reply
#4
thanks for taking time to explain. much appreciated.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020