Python Forum

Full Version: Creating new list based on exact regex match in original list
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I've searched pretty well for a preexisting topic but I can't find anything. I know the answer must be out there, I just don't think I perfectly understand what it is I'm trying to achieve. Learning regex has been a rocky road for me, and although I have the basics mostly down, I can't figure out how to achieve this particular goal.

I am trying to use list comprehension to find a regex match for each index in list firstList. For each index, the exact matching regex should be written to list secondList. If there is no matching regex, the index from firstList will not be written to secondList. However, I also want this list comprehension to strip the path following the domain name and write it to secondList (e.g. "https://gmail.com/test123" at firstList[1] should be written to secondList as "https://gmail.com/")

import re

regex = re.compile(r'^http[s]?:\/?\/?([^:\/\s]+)/')
firstList = ['http:google.com/test', 'https://gmail.com/test123', 'http://youtube.com/watch', 'notaurl', '/home/images']
secondList = [i for i in firstList if regex.match(i)]
print(firstList)
print(secondList)
Output:

Output:
['http:google.com/test', 'https://gmail.com/test123', 'http://youtube.com/watch', 'notaurl', '/home/images'] ['http:google.com/test', 'https://gmail.com/test123', 'http://youtube.com/watch']
As desired, my list comprehension is eliminating list index values that do not have URL components, but it is still including the path following the domain. Why is this? If I use
print(re.match(regex, firstList[1]))
My output shows the match is only https://gmail.com/ through output
Output:
<re.Match object; span=(0, 18), match='https://gmail.com/'>
I understand that my list comprehension method is adding to secondList if there is any regex match at all, but how do I get it to write the match output as seen in re.match to secondList instead of the entirety of the index that has a match?

I don't need anyone to post the solution code, I just need help wrapping my head around this concept of regex I'm clearly misunderstanding. Thanks for any help.

EDIT: I should add that the regex I have in my code above matches components of a URL up to the / that defines the path following a domain.
re.match returns a MatchObject or None. Your code is taking the "string" attribute of the MatchObject and adding that to the list. The string attribute is the string passed to match. Take a look at MatchObject and see how it can provide the string you really want.

This may do what you want:
secondList = [i.group(0) for i in firstList if regex.match(i)]