Python Forum
[regex] Good way to parse variable number of items? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: [regex] Good way to parse variable number of items? (/thread-26774.html)



[regex] Good way to parse variable number of items? - Winfried - May-13-2020

Hello,

Objects in OpenStreetMap can have a variable number of key+value items, depending on what users recorded. As a result, after running a query through OverpassTurbo and exporting data in a GPX file, I end up with heterogenous data in the <desc>…</desc> section.

Is there a smarter way than going through different regexes?

Thank you.

OSM_phone = re.compile("phone=(.+)")
OSM_www = re.compile("website=(.+)")
OSM_email = re.compile("email=(.+)")

gpx = gpxpy.gpx.GPX()
gpx_file = open(INPUT, mode='rt', encoding='utf-8')
gpx = gpxpy.parse(gpx_file)
for waypoint in gpx.waypoints:
	data = {}
	data["latitude"] = waypoint.latitude
	data["longitude"] = waypoint.longitude
	data["name"] = waypoint.name

	m = OSM_phone.search(waypoint.description)
	if m:
		data["phone"] = m.group(1)
	m = OSM_www.search(waypoint.description)
	if m:
		data["www"] = m.group(1)
	m = OSM_email.search(waypoint.description)
	if m:
		data["email"] = m.group(1)

	print(data)



RE: [regex] Good way to parse variable number of items? - bowlofred - May-13-2020

Can you show an example of the data?


RE: [regex] Good way to parse variable number of items? - Winfried - May-14-2020

Here's what it looks like:

<?xml version="1.0" encoding="UTF-8"?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1" creator="overpass-ide"><name>My GPX file</name>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
[email protected]</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
www=http://www.acme.com</desc>
</wpt>
</gpx>


IOW, some waypoints have no relevant data in their "description", some have some of the tuples I'm interested in (email, or phone, or www), and some have all of them.

The code above works, but I was wondering if there were a better way.


RE: [regex] Good way to parse variable number of items? - snippsat - May-14-2020

Now is gpxpy a parser so maybe look closer into what it can do without using regex,as it also has lxml in the stack.
As it's xml can use parser eg BS.
from bs4 import BeautifulSoup

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1" creator="overpass-ide"><name>My GPX file</name>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
[email protected]</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
www=http://www.acme.com</desc>
</wpt>
</gpx>'''

soup = BeautifulSoup(xml, 'lxml')
Test usage:
>>> info = soup.find_all('wpt')[1].text.strip()
>>> print(info)
blah
phone=123456
[email protected]
Fix so it become a dictionary for easier access.
>>> info = info.replace('blah', 'name=blah', 1)
>>> record = {}
>>> for i in info.splitlines():
...      key, value = i.split('=')     
...      record[key] = value
...      
>>> 
>>> record['name']
'blah'
>>> record['mail']
'[email protected]'
>>> record['phone']
'123456' 



RE: [regex] Good way to parse variable number of items? - Winfried - May-15-2020

I suspected there were a better way.

Thank you!