[SOLVED] Why does regex fail cleaning line?

[SOLVED] Why does regex fail cleaning line? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: [SOLVED] Why does regex fail cleaning line? (/thread-34696.html)

[SOLVED] Why does regex fail cleaning line? - Winfried - Aug-22-2021

Hello,

I can't figure out why this regex fails editing an XML header:

with open(INPUTFILE) as reader:
	content = reader.read()
#I need to just get <gpx>
#OK content = re.sub('<gpx', '<BLAH', content)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL|re.IGNORECASE)

Any idea what I could try?

Thank you.

RE: Why does regex fail cleaning line? - Gribouillis - Aug-22-2021

Could you provide example input together with expected output, the output you get and how you get it?

RE: Why does regex fail cleaning line? - Winfried - Aug-22-2021

Here goes:

<?xml version="1.0" encoding="UTF-8"?>
<gpx 
 xmlns="http://www.topografix.com/GPX/1/1" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" 
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>

I need to rewrite the "<gpx…>" into a simple "<gpx>".

I guess the CRLFs are messing things.

RE: Why does regex fail cleaning line? - Gribouillis - Aug-22-2021

This works as long as there is no > in the part that you want to remove

import re

src = """\
<?xml version="1.0" encoding="UTF-8"?>
<gpx 
 xmlns="http://www.topografix.com/GPX/1/1" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" 
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>
"""

res = re.sub(r'<gpx\b[^>]*>', '<gpx>', src)
print(res)

Output:<?xml version="1.0" encoding="UTF-8"?>
<gpx>
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>

RE: Why does regex fail cleaning line? - snippsat - Aug-22-2021

As it's xml a parser can be more suited.
So usually don't need to delete anything,if shall not restructure the xml.
All useful info can easily be parsed

from bs4 import BeautifulSoup

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<gpx
 xmlns="http://www.topografix.com/GPX/1/1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>'''

soup = BeautifulSoup(xml, 'xml')

>>> trk_seg = soup.find('trkseg')
>>> trk_seg
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
>>> 
>>> tr = trk_seg.find('trkpt')
>>> tr
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
>>> tr.text
'101.25'
>>> tr.attrs['lat']
'45.649872'
>>> tr.attrs['lon']
'0.156119'

RE: Why does regex fail cleaning line? - Winfried - Aug-22-2021

Thanks for the tip.

INPUTFILE = "input.gpx"
with open(INPUTFILE) as reader:
	content = reader.read()

#get rid of NS
#BAD content = re.sub(r'<gpx.+?>', '<gpx>', content,re.DOTALL)
content= re.sub(r'<gpx\b[^>]*', '<gpx', content,re.DOTALL)
index=1
for line in content.splitlines():
	print(line)
	if index == 10:
		break
	else:
		index += 1

I'm still curious as to why the regex isn't working as expected :-/

As for using an XML parser: I do later in the script, but I first need to remove the namespace stuff in the header, which is why I first run it through a regex.

--
Edit: .sub() takes five parameters, not four! The fourth one is "the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced." (Source)

This works as expected:

INPUTFILE = "input.gpx"

with open(INPUTFILE) as reader:
	content = reader.read()

content= re.sub('<gpx.+?>', '<gpx>', content,0, re.DOTALL)
index=1
for line in content.splitlines():
	print(line)
	if index == 10:
		break
	else:
		index += 1

Thank you.