Python Forum
[Regex] Findall returns wrong number of hits
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Regex] Findall returns wrong number of hits
#1
Hello,

It's probably something obvious to seasoned Python programmers, but I can't figure out why re.findall() returns the wrong number of hits when using the following code :
import re,sys,locale

# Open file
f = open('input.gpx', 'r')

#OK
#strings = re.findall(r'<trk>', f.read())
#BAD!
strings = re.findall(r'<trk>.+?</trk>', f.read())

f.close()

if strings:
	#27 instead of 348!
	print "Number of items : ", len(strings)
The file contains some French characters. Could it be that accented characters in the input file are preventing Python from reading the whole file?

FWIW, I'm using Python 2.7.14 on Windows.

Thank you.
Reply
#2
post input.gpx justbeamit or sample of file content and wanted output.
(Aug-21-2018, 03:28 PM)Winfried Wrote: FWIW, I'm using Python 2.7.14 on Windows
You should be using Python 3.6/3.7 and pip installation under Windows
Then Unicode support is better,like eg for French characters.
Reply
#3
Thanks.

So I removed 2.7.14, installed 3.7.0*, re-ran the code, and… same error: Wrong number of hits:

import re,sys,locale

# Open file
f = open('input.gpx', 'r')
#GOOD strings = re.findall(r'<trk>', f.read())

#Only 27!
#BAD strings = re.findall(r'<trk>.+?</trk>', f.read())

#Only 27!
p = re.compile(r'(<trk>.+?</trk>)',re.MULTILINE)
strings = p.findall(f.read())

f.close()

if strings:
	#27 instead of 348!
	print("Number of items : " + str(len(strings)))
* Checked that "python -V" returns "Python 3.7.0"
Reply
#4
It's the same problem for us,we don't know how input.gpx look or your wanted output from file.
A normal approach is to always read file with Unicode with UTF-8.
with open('input.gpx', encoding='utf-8') as f:
    print(f.read())
Reply
#5
The code above does display the whole file, as expected.

Maybe it's some odd character stopping the regex module dead in its tracks.

If someone wants to give it a shot, here's the input file.

As a work-around, the gpxpy module works OK with my file.

https://ocefpaf.github.io/python4oceanog...08/18/gpx/

Thanks for your help.
Reply
#6
GPX files use XML namespaces.
Then can use lxml
from lxml import etree

NSMAP = {"gpx": "http://www.topografix.com/GPX/1/1"}
tree = etree.parse("input.gpx")
for elem in tree.findall("gpx:trk", namespaces=NSMAP):
    print(elem)
Output:
<Element {http://www.topografix.com/GPX/1/1}trk at 0x4e469e0> <Element {http://www.topografix.com/GPX/1/1}trk at 0x4e469b8> <Element {http://www.topografix.com/GPX/1/1}trk at 0x4e46990> <Element {http://www.topografix.com/GPX/1/1}trk at 0x4e46968> ....
Try to get something out.
>>> elem
<Element {http://www.topografix.com/GPX/1/1}trk at 0x4e73760>
>>> elem[0].tag
'{http://www.topografix.com/GPX/1/1}name'
>>> elem[0].text
'Viroflay - Vélizy - Arcueil'
Can also search PyPi for gpx parser.
Reply
#7
Thanks, good to know.

I guess the file wasn't clean enough for the regex module to handle.
Reply
#8
I couldn't find why the gpx module uses two "import" lines:

import gpxpy
import gpxpy.gpx
What's the difference?

===
Edit : Found it. It's a sub-module.

To learn about a module, launch Python, and do something like this:

>>> import gpxpy
>>> help(gpxpy)

>>> import gpxpy.gpx
>>> help(gpxpy.gpx)
Reply
#9
In case another newbie stumbles on the same issue…

Gpxpy probably offers useful features anyway, but the reason the regex above didn't work, is simply that two switches are needed: "re.MULTILINE|re.DOTALL".

p = re.compile('<trk>.+?</trk>',re.MULTILINE|re.DOTALL)
m = p.findall(inputfile)
for item in m:
    print(item)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  regex findall() returning weird result Radical 1 588 Oct-15-2023, 08:47 PM
Last Post: snippsat
  np.percentile returns wrong value? AceTylercholine 2 620 Jul-13-2023, 06:59 PM
Last Post: Skaperen
  Am I wrong or is Udemy wrong? String Slicing! Mavoz 3 2,386 Nov-05-2022, 11:33 AM
Last Post: Mavoz
  Python: re.findall to find multiple instances don't work but search worked Secret 1 1,173 Aug-30-2022, 08:40 PM
Last Post: deanhystad
  The code I have written removes the desired number of rows, but wrong rows Jdesi1983 0 1,599 Dec-08-2021, 04:42 AM
Last Post: Jdesi1983
  '|' character within Regex returns a tuple? pprod 10 5,464 Feb-19-2021, 05:29 PM
Last Post: eddywinch82
  regex.findall that won't match anything xiaobai97 1 1,972 Sep-24-2020, 02:02 PM
Last Post: DeaD_EyE
  Please support regex for version number (digits and dots) from a string Tecuma 4 3,102 Aug-17-2020, 09:59 AM
Last Post: Tecuma
  Programme will not returns the day number not the day name Oldman45 8 2,980 Jul-27-2020, 11:29 AM
Last Post: Oldman45
  Regex findall() NewBeie 2 4,235 Jul-10-2020, 12:19 PM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020