Python Forum
Creating new list based on exact regex match in original list
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Creating new list based on exact regex match in original list
#1
I've searched pretty well for a preexisting topic but I can't find anything. I know the answer must be out there, I just don't think I perfectly understand what it is I'm trying to achieve. Learning regex has been a rocky road for me, and although I have the basics mostly down, I can't figure out how to achieve this particular goal.

I am trying to use list comprehension to find a regex match for each index in list firstList. For each index, the exact matching regex should be written to list secondList. If there is no matching regex, the index from firstList will not be written to secondList. However, I also want this list comprehension to strip the path following the domain name and write it to secondList (e.g. "https://gmail.com/test123" at firstList[1] should be written to secondList as "https://gmail.com/")

import re

regex = re.compile(r'^http[s]?:\/?\/?([^:\/\s]+)/')
firstList = ['http:google.com/test', 'https://gmail.com/test123', 'http://youtube.com/watch', 'notaurl', '/home/images']
secondList = [i for i in firstList if regex.match(i)]
print(firstList)
print(secondList)
Output:

Output:
['http:google.com/test', 'https://gmail.com/test123', 'http://youtube.com/watch', 'notaurl', '/home/images'] ['http:google.com/test', 'https://gmail.com/test123', 'http://youtube.com/watch']
As desired, my list comprehension is eliminating list index values that do not have URL components, but it is still including the path following the domain. Why is this? If I use
print(re.match(regex, firstList[1]))
My output shows the match is only https://gmail.com/ through output
Output:
<re.Match object; span=(0, 18), match='https://gmail.com/'>
I understand that my list comprehension method is adding to secondList if there is any regex match at all, but how do I get it to write the match output as seen in re.match to secondList instead of the entirety of the index that has a match?

I don't need anyone to post the solution code, I just need help wrapping my head around this concept of regex I'm clearly misunderstanding. Thanks for any help.

EDIT: I should add that the regex I have in my code above matches components of a URL up to the / that defines the path following a domain.
Reply
#2
re.match returns a MatchObject or None. Your code is taking the "string" attribute of the MatchObject and adding that to the list. The string attribute is the string passed to match. Take a look at MatchObject and see how it can provide the string you really want.

This may do what you want:
secondList = [i.group(0) for i in firstList if regex.match(i)]
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  unable to remove all elements from list based on a condition sg_python 3 373 Jan-27-2024, 04:03 PM
Last Post: deanhystad
  Facing issue in python regex newline match Shr 6 1,145 Oct-25-2023, 09:42 AM
Last Post: Shr
  No matter what I do I get back "List indices must be integers or slices, not list" Radical 4 1,091 Sep-24-2023, 05:03 AM
Last Post: deanhystad
  Move Files based on partial Match mohamedsalih12 2 745 Sep-20-2023, 07:38 PM
Last Post: snippsat
Question in this code, I input Key_word, it can not find although all data was exact Help me! duchien04x4 3 971 Aug-31-2023, 05:36 PM
Last Post: deanhystad
  Delete strings from a list to create a new only number list Dvdscot 8 1,466 May-01-2023, 09:06 PM
Last Post: deanhystad
  Failing regex, space before and after the "match" tester_V 6 1,115 Mar-06-2023, 03:03 PM
Last Post: deanhystad
  List all possibilities of a nested-list by flattened lists sparkt 1 878 Feb-23-2023, 02:21 PM
Last Post: sparkt
  Regex pattern match WJSwan 2 1,187 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  Split pdf in pypdf based upon file regex standenman 1 1,974 Feb-03-2023, 12:01 PM
Last Post: SpongeB0B

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020