[SOLVED] Why does regex fail cleaning line? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: [SOLVED] Why does regex fail cleaning line? (/thread-34696.html) |
[SOLVED] Why does regex fail cleaning line? - Winfried - Aug-22-2021 Hello, I can't figure out why this regex fails editing an XML header: with open(INPUTFILE) as reader: content = reader.read() #I need to just get <gpx> #OK content = re.sub('<gpx', '<BLAH', content) #BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE) #BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL) #BAD content = re.sub('<gpx.+?>', '<gpx>', content,re.MULTILINE|re.DOTALL|re.IGNORECASE)Any idea what I could try? Thank you. RE: Why does regex fail cleaning line? - Gribouillis - Aug-22-2021 Could you provide example input together with expected output, the output you get and how you get it? RE: Why does regex fail cleaning line? - Winfried - Aug-22-2021 Here goes: <?xml version="1.0" encoding="UTF-8"?> <gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1"> <trk> <trkseg> <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt> <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt> <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt> </trkseg> </trk> </gpx>I need to rewrite the "<gpx…>" into a simple "<gpx>". I guess the CRLFs are messing things. RE: Why does regex fail cleaning line? - Gribouillis - Aug-22-2021 This works as long as there is no > in the part that you want to remove import re src = """\ <?xml version="1.0" encoding="UTF-8"?> <gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1"> <trk> <trkseg> <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt> <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt> <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt> </trkseg> </trk> </gpx> """ res = re.sub(r'<gpx\b[^>]*>', '<gpx>', src) print(res)
RE: Why does regex fail cleaning line? - snippsat - Aug-22-2021 As it's xml a parser can be more suited. So usually don't need to delete anything,if shall not restructure the xml. All useful info can easily be parsed from bs4 import BeautifulSoup xml = '''\ <?xml version="1.0" encoding="UTF-8"?> <gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1"> <trk> <trkseg> <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt> <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt> <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt> </trkseg> </trk> </gpx>''' soup = BeautifulSoup(xml, 'xml') >>> trk_seg = soup.find('trkseg') >>> trk_seg <trkseg> <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt> <trkpt lat="43.929379" lon="2.147619"><ele>178</ele></trkpt> <trkpt lat="43.929388" lon="2.147699"><ele>177.75</ele></trkpt> </trkseg> >>> >>> tr = trk_seg.find('trkpt') >>> tr <trkpt lat="45.649872" lon="0.156119"><ele>101.25</ele></trkpt> >>> tr.text '101.25' >>> tr.attrs['lat'] '45.649872' >>> tr.attrs['lon'] '0.156119' RE: Why does regex fail cleaning line? - Winfried - Aug-22-2021 Thanks for the tip. INPUTFILE = "input.gpx" with open(INPUTFILE) as reader: content = reader.read() #get rid of NS #BAD content = re.sub(r'<gpx.+?>', '<gpx>', content,re.DOTALL) content= re.sub(r'<gpx\b[^>]*', '<gpx', content,re.DOTALL) index=1 for line in content.splitlines(): print(line) if index == 10: break else: index += 1I'm still curious as to why the regex isn't working as expected :-/ As for using an XML parser: I do later in the script, but I first need to remove the namespace stuff in the header, which is why I first run it through a regex. -- Edit: .sub() takes five parameters, not four! The fourth one is "the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced." (Source) This works as expected: INPUTFILE = "input.gpx" with open(INPUTFILE) as reader: content = reader.read() content= re.sub('<gpx.+?>', '<gpx>', content,0, re.DOTALL) index=1 for line in content.splitlines(): print(line) if index == 10: break else: index += 1Thank you. |