Python Forum

Full Version: regex findall() returning weird result
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
EDIT: You can likely just skip to the "update" at the bottom of this post.

What I am trying to do is test out regular expression usage to find a phone number in a string (accounting for multiple written formats). Here is my program:
import re

phoneNumRegex = re.compile(r'(\+\d{1,3}( )?)?(\d\d\d|\(\d\d\d\))(-| )\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('my numbers are +1 (515) 444-4446, 333-234-8655.')
moo = phoneNumRegex.findall('my numbers are +1 (515) 444-4446, 333-234-8655.')
print(mo.group())
print(moo)
Output:
+1 (515) 444-4446 [('+1 ', ' ', '(515)', ' '), ('', '', '333', '-')]
So search() is working perfectly. It finds and returns the phone number. But findall() is far from the desired result. Why is it doing this? I would expect it to have the same behaviour, but this time return both phone numbers in a list.

When I greatly simplify the expression in re.compile findall() seems to work:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))
Output:
[('415', '555', '9999'), ('212', '555', '0000')]
So what is happening here? What is the difference? I must be misunderstanding how findall() actually works.
____________________________________________________

UPDATE: I've been looking through some threads of people having a similar issue on this forum, and one of the fixes was to put brackets around the parts you actually want to keep. So here is my updated compile line:
phoneNumRegex = re.compile(r'(\+\d{1,3} ?)?(\d\d\d|\(\d\d\d\))(-| )(\d\d\d)-(\d\d\d\d)')
Output:
[('+1 ', '(515)', ' ', '444', '4446'), ('', '333', '-', '234', '8655')]
So it does capture the ending section of the phone number now, and I've eliminated a few of the returned dashes and spaces. But I still have some extra junk being returned, and I'm not sure how to alter my expression to be able to handle all those exceptions and variability in what it might encounter without putting those sections into brackets...

The sections I'm still trying to eliminate from the output are:
- the (-| ) section
- the brackets around the phone number's area code
- the space after +1
Quote:So search() is working perfectly. It finds and returns the phone number. But findall() is far from the desired result. Why is it doing this?
when use () then making a capturing group and then will findall returns only the capturing groups.
>>> s = 'Red car 99' 
>>> re.findall(r'\w.*\s\d{2}', s)
['Red car 99']
# Add a group
>>> re.findall(r'\w.*\s(\d{2})', s)
['99']
So when add a group () findall only match that group,so this will be group 1 with re.search.
>>> r = re.search(r'\w.*\s(\d{2})', s)
>>> r.group(0)
'Red car 99'
>>> r.group(1)
'99'
I would write it like this if need match the whole phone number.
import re

phone_numbers = [
    "123 numbers are +47 (515) 444-4446, 333-234-8655",
    "my 1277 numbers are +1 (515) 444-4446, 987-654-3210",
    "my numbers are +452 +8 (22) 444-4446, 555-888-7777"
]

pattern = r'\+[\d\s()-]+,\s?[\d-]+'
for phone_number in phone_numbers:
    matches = re.findall(pattern, phone_number)
    for match in matches:
        print(match)
Output:
+47 (515) 444-4446, 333-234-8655 +1 (515) 444-4446, 987-654-3210 +8 (22) 444-4446, 555-888-7777