[SOLVED] Why does regex fail cleaning line?

Winfried · (This post was last modified: Aug-22-2021, 07:04 PM by Winfried.)

Hello,

I can't figure out why this regex fails editing an XML header:

with open(INPUTFILE) as reader:
	content = reader.read()
#I need to just get <gpx>
#OK content = re.sub('<gpx', '<BLAH', content)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL|re.IGNORECASE)

Any idea what I could try?

Thank you.

**Gribouillis** · Aug-22-2021, 03:19 PM

Could you provide example input together with expected output, the output you get and how you get it?

Winfried · Aug-22-2021, 03:22 PM

Here goes:

<?xml version="1.0" encoding="UTF-8"?>
<gpx 
 xmlns="http://www.topografix.com/GPX/1/1" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" 
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>

I need to rewrite the "<gpx…>" into a simple "<gpx>".

I guess the CRLFs are messing things.

**Gribouillis** · (This post was last modified: Aug-22-2021, 05:29 PM by Gribouillis.)

This works as long as there is no > in the part that you want to remove

import re

src = """\
<?xml version="1.0" encoding="UTF-8"?>
<gpx 
 xmlns="http://www.topografix.com/GPX/1/1" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" 
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>
"""

res = re.sub(r'<gpx\b[^>]*>', '<gpx>', src)
print(res)

Output:<?xml version="1.0" encoding="UTF-8"?>
<gpx>
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>

***snippsat*** · (This post was last modified: Aug-22-2021, 06:00 PM by snippsat.)

As it's xml a parser can be more suited.
So usually don't need to delete anything,if shall not restructure the xml.
All useful info can easily be parsed

from bs4 import BeautifulSoup

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<gpx
 xmlns="http://www.topografix.com/GPX/1/1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>'''

soup = BeautifulSoup(xml, 'xml')

>>> trk_seg = soup.find('trkseg')
>>> trk_seg
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
>>> 
>>> tr = trk_seg.find('trkpt')
>>> tr
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
>>> tr.text
'101.25'
>>> tr.attrs['lat']
'45.649872'
>>> tr.attrs['lon']
'0.156119'

Winfried · (This post was last modified: Aug-22-2021, 06:59 PM by Winfried.)

Thanks for the tip.

INPUTFILE = "input.gpx"
with open(INPUTFILE) as reader:
	content = reader.read()

#get rid of NS
#BAD content = re.sub(r'<gpx.+?>', '<gpx>', content,re.DOTALL)
content= re.sub(r'<gpx\b[^>]*', '<gpx', content,re.DOTALL)
index=1
for line in content.splitlines():
	print(line)
	if index == 10:
		break
	else:
		index += 1

I'm still curious as to why the regex isn't working as expected :-/

As for using an XML parser: I do later in the script, but I first need to remove the namespace stuff in the header, which is why I first run it through a regex.

--
Edit: .sub() takes five parameters, not four! The fourth one is "the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced." (Source)

This works as expected:

INPUTFILE = "input.gpx"

with open(INPUTFILE) as reader:
	content = reader.read()

content= re.sub('<gpx.+?>', '<gpx>', content,0, re.DOTALL)
index=1
for line in content.splitlines():
	print(line)
	if index == 10:
		break
	else:
		index += 1

Thank you.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[SOLVED] [Beautiful Soup] Move line to top in HTML head?	Winfried	0	305	Apr-13-2025, 05:50 AM Last Post: Winfried
	[SOLVED] Can't figure out right regex	Winfried	3	610	Mar-02-2025, 05:57 PM Last Post: Winfried
	simple if fail, what am I missing?	ajkrueger25	2	798	Nov-13-2024, 04:21 AM Last Post: ajkrueger25
	keeping logs for every success fail attempt	robertkwild	22	5,840	Jul-19-2024, 03:49 PM Last Post: robertkwild
	[solved] Regex expression do not want to taken :/	SpongeB0B	2	4,264	Nov-06-2023, 02:43 PM Last Post: SpongeB0B
	Why does [root.destroy, exit()]) fail after pyinstaller? Rpi	Edward_	4	1,889	Oct-18-2023, 11:09 PM Last Post: Edward_
	Cleaning my code to make it more efficient	BSDevo	13	3,699	Sep-27-2023, 10:39 PM Last Post: BSDevo
	Help with a regex? (solved)	wrybread	3	1,571	May-01-2023, 05:12 AM Last Post: deanhystad
	[SOLVED] [regex] Why isn't possible substring ignored?	Winfried	4	2,150	Apr-08-2023, 06:36 PM Last Post: Winfried
	[SOLVED] Alternative to regex to extract date from whole timestamp?	Winfried	6	3,732	Nov-16-2022, 01:49 PM Last Post: carecavoador

[SOLVED] Why does regex fail cleaning line?

User Panel Messages

Announcements