Python Forum
Different Output of findall and search in re module - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Different Output of findall and search in re module (/thread-8904.html)



Different Output of findall and search in re module - shiva - Mar-12-2018

link = '<a href="http://www.google.com">Google</a>'
re.search('<a[^>]+href=["\'](.*?)["\']',link,re.IGNORECASE).group()
This code gives the output '<a href="http://www.google.com"'

re.findall('<a[^>]+href=["\'](.*?)["\']',link,re.IGNORECASE)
But this code gives the output ['http://www.google.com']

Why are both the outputs different? findall() should work like search() except findall() gives a list of matches and search() gives only a single match.


RE: Different Output of findall and search in re module - snippsat - Mar-12-2018

re.findall returns all captured groups in a list,in this case what's inside group(1) --> (.*?).
re.search return first match inside group(1) --> (.*?).
import re

link = '''\
 <a href="http://www.google.com">Google</a>
 <a href="https://www.microsoft.com">Microsoft</a>'''

print(re.search('<a[^>]+href=["\'](.*?)["\']',link,re.IGNORECASE).group(1))
print('--------------')
print(re.findall('<a[^>]+href=["\'](.*?)["\']',link,re.IGNORECASE))
Output:
http://www.google.com -------------- ['http://www.google.com', 'https://www.microsoft.com']
Both solution can be looked at as the wrong way,because HTML should not be used with regex read You can't parse [X]HTML with regex Evil
from bs4 import BeautifulSoup

link = '''\
 <a href="http://www.google.com">Google</a>
 <a href="https://www.microsoft.com">Microsoft</a>'''

soup = BeautifulSoup(link, 'lxml')
for link in soup.find_all('a'):
    print(link.get('href'))
Output:
http://www.google.com https://www.microsoft.com