RE output clarity - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: RE output clarity (/thread-10843.html) |
RE output clarity - bluefrog - Jun-09-2018 Hi I'd like to know why the output is only ['a'], instead of ['a', 'b', 'a']. I thought that the presence of "|" would mach either a or b. The presence of "+" is 1 or more, so I thought that it would attempt to match all of "aba" as a result. >>> import re >>> re.findall(r'(a|b)+', 'aba') ['a']Am using 3.6.5. RE: RE output clarity - snippsat - Jun-09-2018 (a|b)+ will be a repeated capturing group + and only capture the last iteration.So it match the last a .Can remove + and also () .>>> import re >>> >>> re.findall(r'(a|b)+', 'aba') ['a'] >>> re.findall(r'(a|b)', 'aba') ['a', 'b', 'a'] >>> re.findall(r'a|b', 'aba') ['a', 'b', 'a']If thinking of group a or b in own group. Then like this with use of re.search() and capture groups.re.findall() just returns all captured groups.>>> import re >>> >>> r = re.search(r'((a)|(b))+', 'aba') >>> r.group(1) 'a' >>> r.group(2) 'a' >>> r.group(3) 'b' RE: RE output clarity - killerrex - Jun-09-2018 I think the trick is in the findall definition: So the RE '(a|b)+' applied over the string 'aba' produces the following sequence:- Applied from char 0: Match 'aba' - Applied from char 1: Match 'ba', so the previous match is discarded - Applied from char 2: Match 's' and the previous match is discarded. So it is not returning the 1st 'a' but the last... you can check it with re.findall(r'(a|b|c)+', 'abc') that returns ['c']One way to obtain what you expect is to transform the RE from greedy to non-greedy or use single char matches: >>> re.findall(r'(a|b)+?', 'aba') ['a', 'b', 'a'] >>> re.findall(r'[ab]', 'aba') ['a', 'b', 'a'] >>> re.findall(r'(a|b)', 'aba') ['a', 'b', 'a']If what you really want is to match the full string, then you need an expression that does not create sub-matches, so findall returns the biggest possible: >>> re.findall(r'(?:a|b)+', 'aba') ['aba'] >>> re.findall(r'[ab]+', 'aba') ['aba'] RE: RE output clarity - bluefrog - Jun-10-2018 thanks for taking time to explain. much appreciated. |