Python Forum
Help understanding RegEx logic/output
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help understanding RegEx logic/output
#1
Hi All,

I'm (brand) new to Python (self-learning using books/websites), so please forgive me if I'm asking real dumb/silly questions.

I'm currently at the stage where I'm learning Regular Expressions, and whilst I've understood a lot of what the book covers, I am stumped with how the following works (to output the kind of results it does).

import re
reo_agent_names = re.compile(r'Agent (\w)\w*')
mo = reo_agent_names.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
print(mo)
Output:
A**** told C**** that E**** knew B**** was a double agent.
So...what I don't understand from the code and output above is how the first letter of each agents' name is retained, and four asterisk's appended to that first character.

In other words, I can't quite figure out how the regex definition of .compile(r'Agent (\w)\w*') works with the .sub(r'\1****', '....') to give the output shown above.

My comprehension would say that the result/output should contain Alice****, Carol**** etc....and not A****, C**** etc.

What am I missing/not understanding? I believe my confusion/lack of understanding lies in how the .sub(r'\1****') works

In addition to the above, another thing I don't quite understand is the difference between the outputs of the .search() method and the .findall() method.

import re
reo_agent_names = re.compile(r'Agent (\w)\w*')
print(reo_agent_names.search('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.').group())
print(reo_agent_names.findall('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.'))
Using the code above, the output for the .search() method is 'Agent Alice'...whereas, the output for the .findall() method is ['A', 'C', 'E', 'B']. Huh How/why...did the .search() not return just 'A'...like the .findall() did for the first agents' name?

If someone could explain these two mysteries to me as succinctly as possible, that would be super awesome.

Thanks.
Reply
#2
\w matches a single character (only), so the parentheses (\w) captures a single character.

Any remaining characters are matched by the \w* expression. The asterisk allows it to match repeatedly. Without the parentheses, the characters are not captured.
pyNewbee likes this post
Reply
#3
search vs findall.

Search returns a Match object. One of the attributes of that object is group. With no arguments (like you're using), it's the same as group(0), and means the entire match. To pull just the info from the first parenthesis, you need group(1).

Findall is different and just returns the matches if no capture,
>>> re.findall("\w*", "The quick brown")
['The', '', 'quick', '', 'brown', '']

the list of captures if one capture matches (this is your example),
>>> re.findall("(\w)\w*", "The quick brown")
['T', 'q', 'b']
or a tuple of captures if multiple captures match.
>>> re.findall("(\w)(\w)\w*", "The quick brown")
[('T', 'h'), ('q', 'u'), ('b', 'r')]
If you'd prefer to process the match object directly (like you do with search), use finditer instead.
pyNewbee likes this post
Reply
#4
Thumbs Up 
(Nov-14-2020, 07:00 PM)bowlofred Wrote: \w matches a single character (only), so the parentheses (\w) captures a single character.

Any remaining characters are matched by the \w* expression. The asterisk allows it to match repeatedly. Without the parentheses, the characters are not captured.

Ah, okay...now that makes things a lot clearer!

I think the book that I'm going through mentions that the \w matches a whole word...and that's what threw me off.

Thanks a bunch, much appreciated!
Reply
#5
Thumbs Up 
(Nov-14-2020, 07:16 PM)bowlofred Wrote: search vs findall.

Search returns a Match object. One of the attributes of that object is group. With no arguments (like you're using), it's the same as group(0), and means the entire match. To pull just the info from the first parenthesis, you need group(1).

Findall is different and just returns the matches if no capture,
>>> re.findall("\w*", "The quick brown")
['The', '', 'quick', '', 'brown', '']

the list of captures if one capture matches (this is your example),
>>> re.findall("(\w)\w*", "The quick brown")
['T', 'q', 'b']
or a tuple of captures if multiple captures match.
>>> re.findall("(\w)(\w)\w*", "The quick brown")
[('T', 'h'), ('q', 'u'), ('b', 'r')]
If you'd prefer to process the match object directly (like you do with search), use finditer instead.

Oh wow, excellent explanation and examples!

So clear and easy to understand.

Thanks a million
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Common understanding of output processing with conditional statement neail 6 874 Sep-17-2023, 03:58 PM
Last Post: neail
  Understanding Regex Groups matt_the_hall 5 2,818 Jan-11-2021, 02:55 PM
Last Post: matt_the_hall
  understanding output of bytes/raw data rootVIII 3 2,751 Aug-01-2019, 01:00 PM
Last Post: rootVIII
  Understanding "help()" output? Athenaeum 4 3,886 Sep-29-2017, 09:47 PM
Last Post: nilamo
  help on understanding this output landlord1984 1 2,947 Mar-08-2017, 08:29 PM
Last Post: zivoni

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020