Hello,
I can't figure out why this regex fails editing an XML header:
with open(INPUTFILE) as reader:
content = reader.read()
#I need to just get <gpx>
#OK content = re.sub('<gpx', '<BLAH', content)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL|re.IGNORECASE)
Any idea what I could try?
Thank you.
Could you provide example input together with expected output, the output you get and how you get it?
Here goes:
<?xml version="1.0" encoding="UTF-8"?>
<gpx
xmlns="http://www.topografix.com/GPX/1/1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
version="1.1">
<trk>
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
</trk>
</gpx>
I need to rewrite the "<gpx…>" into a simple "<gpx>".
I guess the CRLFs are messing things.
This works as long as there is no > in the part that you want to remove
import re
src = """\
<?xml version="1.0" encoding="UTF-8"?>
<gpx
xmlns="http://www.topografix.com/GPX/1/1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
version="1.1">
<trk>
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
</trk>
</gpx>
"""
res = re.sub(r'<gpx\b[^>]*>', '<gpx>', src)
print(res)
Output:
<?xml version="1.0" encoding="UTF-8"?>
<gpx>
<trk>
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
</trk>
</gpx>
As it's xml a parser can be more suited.
So usually don't need to delete anything,if shall not restructure the xml.
All useful info can easily be parsed
from bs4 import BeautifulSoup
xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<gpx
xmlns="http://www.topografix.com/GPX/1/1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
version="1.1">
<trk>
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
</trk>
</gpx>'''
soup = BeautifulSoup(xml, 'xml')
>>> trk_seg = soup.find('trkseg')
>>> trk_seg
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
>>>
>>> tr = trk_seg.find('trkpt')
>>> tr
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
>>> tr.text
'101.25'
>>> tr.attrs['lat']
'45.649872'
>>> tr.attrs['lon']
'0.156119'
Thanks for the tip.
INPUTFILE = "input.gpx"
with open(INPUTFILE) as reader:
content = reader.read()
#get rid of NS
#BAD content = re.sub(r'<gpx.+?>', '<gpx>', content,re.DOTALL)
content= re.sub(r'<gpx\b[^>]*', '<gpx', content,re.DOTALL)
index=1
for line in content.splitlines():
print(line)
if index == 10:
break
else:
index += 1
I'm still curious as to why the regex isn't working as expected :-/
As for using an XML parser: I do later in the script, but I first need to remove the namespace stuff in the header, which is why I first run it through a regex.
--
Edit: .sub() takes
five parameters, not four! The fourth one is "the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced." (
Source)
This works as expected:
INPUTFILE = "input.gpx"
with open(INPUTFILE) as reader:
content = reader.read()
content= re.sub('<gpx.+?>', '<gpx>', content,0, re.DOTALL)
index=1
for line in content.splitlines():
print(line)
if index == 10:
break
else:
index += 1
Thank you.