Python Forum
[SOLVED] Why does regex fail cleaning line?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] Why does regex fail cleaning line?
#1
Hello,

I can't figure out why this regex fails editing an XML header:

with open(INPUTFILE) as reader:
	content = reader.read()
#I need to just get <gpx>
#OK content = re.sub('<gpx', '<BLAH', content)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL)
#BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL|re.IGNORECASE)
Any idea what I could try?

Thank you.
Reply
#2
Could you provide example input together with expected output, the output you get and how you get it?
Reply
#3
Here goes:

<?xml version="1.0" encoding="UTF-8"?>
<gpx 
 xmlns="http://www.topografix.com/GPX/1/1" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" 
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>
I need to rewrite the "<gpx…>" into a simple "<gpx>".

I guess the CRLFs are messing things.
Reply
#4
This works as long as there is no > in the part that you want to remove
import re

src = """\
<?xml version="1.0" encoding="UTF-8"?>
<gpx 
 xmlns="http://www.topografix.com/GPX/1/1" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" 
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>
"""

res = re.sub(r'<gpx\b[^>]*>', '<gpx>', src)
print(res)
Output:
<?xml version="1.0" encoding="UTF-8"?> <gpx> <trk> <trkseg> <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt> <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt> <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt> </trkseg> </trk> </gpx>
Reply
#5
As it's xml a parser can be more suited.
So usually don't need to delete anything,if shall not restructure the xml.
All useful info can easily be parsed
from bs4 import BeautifulSoup

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<gpx
 xmlns="http://www.topografix.com/GPX/1/1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
 version="1.1">
 <trk>
  <trkseg>
   <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
      <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
   <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
  </trkseg>
 </trk>
</gpx>'''

soup = BeautifulSoup(xml, 'xml')
>>> trk_seg = soup.find('trkseg')
>>> trk_seg
<trkseg>
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
<trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt>
<trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt>
</trkseg>
>>> 
>>> tr = trk_seg.find('trkpt')
>>> tr
<trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt>
>>> tr.text
'101.25'
>>> tr.attrs['lat']
'45.649872'
>>> tr.attrs['lon']
'0.156119'
Gribouillis likes this post
Reply
#6
Thanks for the tip.

INPUTFILE = "input.gpx"
with open(INPUTFILE) as reader:
	content = reader.read()

#get rid of NS
#BAD content = re.sub(r'<gpx.+?>', '<gpx>', content,re.DOTALL)
content= re.sub(r'<gpx\b[^>]*', '<gpx', content,re.DOTALL)
index=1
for line in content.splitlines():
	print(line)
	if index == 10:
		break
	else:
		index += 1
I'm still curious as to why the regex isn't working as expected :-/

As for using an XML parser: I do later in the script, but I first need to remove the namespace stuff in the header, which is why I first run it through a regex.

--
Edit: .sub() takes five parameters, not four! The fourth one is "the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced." (Source)

This works as expected:

INPUTFILE = "input.gpx"

with open(INPUTFILE) as reader:
	content = reader.read()

content= re.sub('<gpx.+?>', '<gpx>', content,0, re.DOTALL)
index=1
for line in content.splitlines():
	print(line)
	if index == 10:
		break
	else:
		index += 1
Thank you.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [solved] Regex expression do not want to taken :/ SpongeB0B 2 660 Nov-06-2023, 02:43 PM
Last Post: SpongeB0B
  Why does [root.destroy, exit()]) fail after pyinstaller? Rpi Edward_ 4 582 Oct-18-2023, 11:09 PM
Last Post: Edward_
  Cleaning my code to make it more efficient BSDevo 13 1,275 Sep-27-2023, 10:39 PM
Last Post: BSDevo
  Help with a regex? (solved) wrybread 3 773 May-01-2023, 05:12 AM
Last Post: deanhystad
  [SOLVED] [regex] Why isn't possible substring ignored? Winfried 4 1,015 Apr-08-2023, 06:36 PM
Last Post: Winfried
  [SOLVED] Alternative to regex to extract date from whole timestamp? Winfried 6 1,777 Nov-16-2022, 01:49 PM
Last Post: carecavoador
  How to calculated how many fail in each site(s) in csv files SamLiu 4 1,250 Sep-26-2022, 06:28 AM
Last Post: SamLiu
  regex multi-line kucingkembar 6 1,444 Aug-27-2022, 10:27 PM
Last Post: kucingkembar
  Apply textual data cleaning to several CSV files ErcoleL99 0 810 Jul-09-2022, 03:01 PM
Last Post: ErcoleL99
  Imports that work with Python 3.8 fail with 3.9 and 3.10 4slam 1 2,546 Mar-11-2022, 01:50 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020