Help understanding RegEx logic/output

pyNewbee · Nov-14-2020, 06:45 PM

Hi All,

I'm (brand) new to Python (self-learning using books/websites), so please forgive me if I'm asking real dumb/silly questions.

I'm currently at the stage where I'm learning Regular Expressions, and whilst I've understood a lot of what the book covers, I am stumped with how the following works (to output the kind of results it does).

import re
reo_agent_names = re.compile(r'Agent (\w)\w*')
mo = reo_agent_names.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
print(mo)

Output:
A**** told C**** that E**** knew B**** was a double agent.

So...what I don't understand from the code and output above is how the first letter of each agents' name is retained, and four asterisk's appended to that first character.

In other words, I can't quite figure out how the regex definition of .compile(r'Agent (\w)\w*') works with the .sub(r'\1****', '....') to give the output shown above.

My comprehension would say that the result/output should contain Alice****, Carol**** etc....and not A****, C**** etc.

What am I missing/not understanding? I believe my confusion/lack of understanding lies in how the .sub(r'\1****') works

In addition to the above, another thing I don't quite understand is the difference between the outputs of the .search() method and the .findall() method.

import re
reo_agent_names = re.compile(r'Agent (\w)\w*')
print(reo_agent_names.search('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.').group())
print(reo_agent_names.findall('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.'))

Using the code above, the output for the .search() method is 'Agent Alice'...whereas, the output for the .findall() method is ['A', 'C', 'E', 'B']. Huh

How/why...did the .search() not return just 'A'...like the .findall() did for the first agents' name?

If someone could explain these two mysteries to me as succinctly as possible, that would be super awesome.

Thanks.

bowlofred · (This post was last modified: Nov-14-2020, 07:00 PM by bowlofred.)

\w matches a single character (only), so the parentheses (\w) captures a single character.

Any remaining characters are matched by the \w* expression. The asterisk allows it to match repeatedly. Without the parentheses, the characters are not captured.

bowlofred · Nov-14-2020, 07:16 PM

search vs findall.

Search returns a Match object. One of the attributes of that object is group. With no arguments (like you're using), it's the same as group(0), and means the entire match. To pull just the info from the first parenthesis, you need group(1).

Findall is different and just returns the matches if no capture,

>>> re.findall("\w*", "The quick brown")
['The', '', 'quick', '', 'brown', '']

the list of captures if one capture matches (this is your example),

>>> re.findall("(\w)\w*", "The quick brown")
['T', 'q', 'b']

or a tuple of captures if multiple captures match.

>>> re.findall("(\w)(\w)\w*", "The quick brown")
[('T', 'h'), ('q', 'u'), ('b', 'r')]

If you'd prefer to process the match object directly (like you do with search), use finditer instead.

pyNewbee · Nov-15-2020, 02:01 AM

(Nov-14-2020, 07:00 PM)bowlofred Wrote: \w matches a single character (only), so the parentheses (\w) captures a single character.

Any remaining characters are matched by the \w* expression. The asterisk allows it to match repeatedly. Without the parentheses, the characters are not captured.

Ah, okay...now that makes things a lot clearer!

I think the book that I'm going through mentions that the \w matches a whole word...and that's what threw me off.

Thanks a bunch, much appreciated!

pyNewbee · Nov-15-2020, 02:21 AM

(Nov-14-2020, 07:16 PM)bowlofred Wrote: search vs findall.

Search returns a Match object. One of the attributes of that object is group. With no arguments (like you're using), it's the same as group(0), and means the entire match. To pull just the info from the first parenthesis, you need group(1).

Findall is different and just returns the matches if no capture,
>>> re.findall("\w*", "The quick brown")
['The', '', 'quick', '', 'brown', '']
the list of captures if one capture matches (this is your example),
>>> re.findall("(\w)\w*", "The quick brown")
['T', 'q', 'b']
or a tuple of captures if multiple captures match.
>>> re.findall("(\w)(\w)\w*", "The quick brown")
[('T', 'h'), ('q', 'u'), ('b', 'r')]
If you'd prefer to process the match object directly (like you do with search), use finditer instead.

Oh wow, excellent explanation and examples!

So clear and easy to understand.

Thanks a million

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Common understanding of output processing with conditional statement	neail	6	874	Sep-17-2023, 03:58 PM Last Post: neail
	Understanding Regex Groups	matt_the_hall	5	2,818	Jan-11-2021, 02:55 PM Last Post: matt_the_hall
	understanding output of bytes/raw data	rootVIII	3	2,751	Aug-01-2019, 01:00 PM Last Post: rootVIII
	Understanding "help()" output?	Athenaeum	4	3,886	Sep-29-2017, 09:47 PM Last Post: nilamo
	help on understanding this output	landlord1984	1	2,947	Mar-08-2017, 08:29 PM Last Post: zivoni

Help understanding RegEx logic/output

User Panel Messages

Announcements