Who enjoys Py RegEx? re.sub() isn't working

goodsignal · Jun-07-2020, 10:11 PM

From all I've read, these two functions should produce the same results, given a regex pattern with two groups. I'd like to know if I'm using re.sub() incorrectly or if I've found some bug.

match = re.search(pattern, input)
result1 = match.group(1) + match.group(2)
result2 = re.sub(pattern, replace with groups 1 & 2, input)

Can you think of any reason re.sub() would pull in a bunch of garbage that isn't in either of the groups? Given a statement like

import re
re.sub( regexpattern, "\g<1>\g<2>", SourceText)

For instance this is a line of source text

Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf.md

where the Red text is identified as groups 1 and 2. The re.sub() should put it together as Bodywork.md but it doesn't! I've used match.groups() from the same library as a sanity check.

I've put together some sample code with some text to search, based on a conversion I'm trying to do for a small project.

Here's the output first. Thanks for looking! Smile

Output:index: 1
Source : Projects bf587944624a417c83475fdb67c176ba.md
Groups : ('Projects', '.md')
Result1: Projects.md
Result2: Projects.md

index: 3
Source : Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf.md
Groups : ('Bodywork', '.md')
Result1: Bodywork.md
Result2: Projects bf587944624a417c83475fdb67c176ba/Bodywork.md

index: 5
Source : Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Home Exercise 4871ab1851074a1cb7aebe0851669345.csv
Groups : ('Home Exercise', '.csv')
Result1: Home Exercise.csv
Result2: Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Home Exercise.csv

import re

paths = ['Projects bf587944624a417c83475fdb67c176ba/',
 'Projects bf587944624a417c83475fdb67c176ba.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Home Exercise 4871ab1851074a1cb7aebe0851669345/',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Home Exercise 4871ab1851074a1cb7aebe0851669345.csv',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Home Exercise 4871ab1851074a1cb7aebe0851669345/Abs da0050d8459345419d1a16062273cfac.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Home Exercise 4871ab1851074a1cb7aebe0851669345/Core 82039eb85d5d46bc99e8504427d203c4.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Micronutrient Smoothie 21e2b0c0922d46f387c8b353a17ff734.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self Bodywork 0045821b69f445678e07d49b5c80b9d0/',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self Bodywork 0045821b69f445678e07d49b5c80b9d0.csv',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self Bodywork 0045821b69f445678e07d49b5c80b9d0/Cuboid physical therapy ff8d7937722a4af6aa2ce1ce8c45672b.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self Bodywork 0045821b69f445678e07d49b5c80b9d0/Extending hamstrings faeba9f5302340f1945b898c6291aa86.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self Bodywork 0045821b69f445678e07d49b5c80b9d0/Knee care 8265d491502a49b0abf2922d9e7764e3.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self Bodywork 0045821b69f445678e07d49b5c80b9d0/Shoulder therapy massage motion 49e3a56cbbfc4733a0ddda272c504912.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self care weekly splits 861d60286a1e48dbb7ed7556d4214622/',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self care weekly splits 861d60286a1e48dbb7ed7556d4214622.md',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Self care weekly splits 861d60286a1e48dbb7ed7556d4214622/Self Bodywork 5c1104d3456a4106872eee9dc531e182.csv',
 'Projects bf587944624a417c83475fdb67c176ba/Bodywork 731fe478ea6048e1ac0df8c7f7ed95bf/Workout weekly splits 4b9d808f79544f5489ef063f1048109a.md',
 'Projects bf587944624a417c83475fdb67c176ba/PROJECTS TEMPLATE a6292c48f0d343c9a2913c0adf97bbf2.md',
 'Routine 60ee969daa894c4d9abdb0d58166f5d4/',
 'Routine 60ee969daa894c4d9abdb0d58166f5d4.csv',
 'Routine 60ee969daa894c4d9abdb0d58166f5d4/Evening Routine 29e2c5282db04c76a59d0053eb9e85ee.md',
 'Routine 60ee969daa894c4d9abdb0d58166f5d4/Morning Routine 4570adb138b7412a8bbe948746585924.md',
 'Routine 60ee969daa894c4d9abdb0d58166f5d4/Physical Activity 8b8ba3700a194ba7ad6330802ecccdf5.md']

filenamepattern = "([\w\s]+)\s\w{32}(\.md|\.csv)$" #regex capture groups 1 & 2
# Create an indexed list of new filenames
index = []
fname1 = []
fname2 = []

for line in enumerate(paths):
    match = re.search(filenamepattern,line[1]) #Search &
    if match:
        index.append(line[0]) #save index for paths changes
        
        fname1.append( match.group(1) + match.group(2) ) #Replace 1 using re.group()
        fname2.append( re.sub(filenamepattern, "\g<1>\g<2>", line[1]) ) #Search & Replace 2 using re.sub()

        if len(index) <= 3: #print a few for comparison
            print("index:",index[-1])
            print("Source : "+line[1])
            print("Groups :",match.groups())
            print("Result1: "+fname1[-1])
            print("Result2: "+fname2[-1])
            print()

I've put up the regex with the same sample data at regexr dot com. I don't think I'm allowed to add links here as a new member but if you want to modify it and see results right away just add /568jc to the end of the URL. I'm not at all affiliated. Just a cool website!

goodsignal · Jun-07-2020, 11:26 PM

I pasted the wrong regexr suffix. This one actually has the interactive settings to match this thread /568np

bowlofred · Jun-07-2020, 11:43 PM

1 and 2 only are the same if the match is the whole string. If the match is a substring, then the portion of the string that wasn't matched isn't touched.

For index1, "Projects....md" is the whole string. So when the sub happens, the string returned is just the matching groups.

For index3, "Bodywork....md" is the second half of the string. That part is removed and replaced with the groups, and the initial part of the string (everything before "Bodywork") is left in place.

It's possible that you could extend your match to the entire string by adding a "^.*" to the start of your match. But I haven't looked to see if that would cause any other problems.

>>> re.sub("Bar", "-", "FooBarBaz")
'Foo-Baz'
>>> re.sub(".*Bar.*", "-", "FooBarBaz")
'-'

goodsignal · Jun-08-2020, 05:49 AM

This perspective brings everything together! Thank you! The behavior makes complete sense now that I understand the nuance between re.sub() and the combination of re.search() and re.groups()

Just to summarize in my own words:

re.search() will return the extent of the pattern match, which could be a sub-string of the input.

re.sub() will return the entire input, only replacing the portion that matches the regex pattern.

When working line-by-line, as is often done in python, it's possible to make re.search() behave like re.sub() by matching the entirety of every line then and modify the line by strategic captured group management.

And it's possible to make re.sub() and re.search provide the same results with the same regex pattern, but the two approaches might require different captured group pattern management to respectively produce the same results.

The two can accomplish the same thing, but their approaches are different. Forcing one to act like the other requires more work. Understanding the difference will lead to choosing the tool with a more efficient approach.

Thanks again @bowlofred !

bowlofred · (This post was last modified: Jun-08-2020, 06:12 AM by bowlofred.)

(Jun-08-2020, 05:49 AM)goodsignal Wrote: re.search() will return the extent of the pattern match, which could be a sub-string of the input.

Basically, yes. Technically, it returns a match object, which includes the start/end/span and any capturing groups (among other things).

Quote:re.sub() will return the entire input, only replacing the portion that matches the regex pattern.

Yup. (And by default, the replacement can happen multiple times. While re.search() will only match the leftmost position).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Regex to extract IPs between () not working	mrapple2020	5	4,477	Apr-12-2019, 08:03 AM Last Post: DeaD_EyE

Who enjoys Py RegEx? re.sub() isn't working

User Panel Messages

Announcements