Python Forum
Python re.sub text manipulation on matched contents before substituting - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Python re.sub text manipulation on matched contents before substituting (/thread-26857.html)



Python re.sub text manipulation on matched contents before substituting - xilex - May-16-2020

Hi, I have the following code below:

import re

t = '<img data src="some-thing">'
pattern = '(<img ).*?(src=")(.*?)(">)'
u = re.sub(pattern, '\\1\\2\\3\\4', t)
print(u)
The output is <img src="some-thing">. Is there a way to do some text manipulation on group 3 so it ends up being <img src="something">? I can't think of a better approach right now. This is a just a basic example of what I am trying to do, the content I am replacing is more complex that just replacing a dash. Thanks.

Found solution with a function: https://docs.python.org/3/library/re.html#text-munging


RE: Python re.sub text manipulation on matched contents before substituting - bowlofred - May-16-2020

If you just want to replace the dash in the string after you've captured it, use string replace.

>>> "some-thing".replace("-","")
'something'
But it sounds like you want to modify data inside an HTML document, retaining the document. Trying to do that with regular expressions is tedious, and likely to break as soon as the html gets a bit wonky. I'd use an HTML parser instead. It's a bit more overhead than doing a teeny regex, but it's much more reliable and flexible. Here I used beautifulsoup4.

import bs4

t = '<img data src="some-thing">'
soup = bs4.BeautifulSoup(t, features="html.parser")
soup.find('img')['src'] = soup.find('img')['src'].replace('-','')
print(soup)
Output:
Before -> <img data src="some-thing"> After -> <img data="" src="something"/>



RE: Python re.sub text manipulation on matched contents before substituting - xilex - May-19-2020

(May-16-2020, 05:04 AM)bowlofred Wrote: But it sounds like you want to modify data inside an HTML document, retaining the document. Trying to do that with regular expressions is tedious, and likely to break as soon as the html gets a bit wonky. I'd use an HTML parser instead. It's a bit more overhead than doing a teeny regex, but it's much more reliable and flexible. Here I used beautifulsoup4.

Thanks. I have run across beautiful soup when learning Python. I'm currently using selenium with xpath selectors. Maybe I'll look into in the future. But right now the substitution I'm doing is a bit more complex than just finding img src attributes and changing them. I think beautifulsoup can do it, but I'd have to spend time to rewrite everything Big Grin