Python Forum
[regex] Good way to parse variable number of items?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[regex] Good way to parse variable number of items?
#1
Hello,

Objects in OpenStreetMap can have a variable number of key+value items, depending on what users recorded. As a result, after running a query through OverpassTurbo and exporting data in a GPX file, I end up with heterogenous data in the <desc>…</desc> section.

Is there a smarter way than going through different regexes?

Thank you.

OSM_phone = re.compile("phone=(.+)")
OSM_www = re.compile("website=(.+)")
OSM_email = re.compile("email=(.+)")

gpx = gpxpy.gpx.GPX()
gpx_file = open(INPUT, mode='rt', encoding='utf-8')
gpx = gpxpy.parse(gpx_file)
for waypoint in gpx.waypoints:
	data = {}
	data["latitude"] = waypoint.latitude
	data["longitude"] = waypoint.longitude
	data["name"] = waypoint.name

	m = OSM_phone.search(waypoint.description)
	if m:
		data["phone"] = m.group(1)
	m = OSM_www.search(waypoint.description)
	if m:
		data["www"] = m.group(1)
	m = OSM_email.search(waypoint.description)
	if m:
		data["email"] = m.group(1)

	print(data)
Reply
#2
Can you show an example of the data?
Reply
#3
Here's what it looks like:

<?xml version="1.0" encoding="UTF-8"?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1" creator="overpass-ide"><name>My GPX file</name>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
[email protected]</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
www=http://www.acme.com</desc>
</wpt>
</gpx>


IOW, some waypoints have no relevant data in their "description", some have some of the tuples I'm interested in (email, or phone, or www), and some have all of them.

The code above works, but I was wondering if there were a better way.
Reply
#4
Now is gpxpy a parser so maybe look closer into what it can do without using regex,as it also has lxml in the stack.
As it's xml can use parser eg BS.
from bs4 import BeautifulSoup

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" version="1.1" creator="overpass-ide"><name>My GPX file</name>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
[email protected]</desc>
</wpt>
<wpt lat="48.123" lon="2.456">
<name>blah</name>
<desc>phone=123456
www=http://www.acme.com</desc>
</wpt>
</gpx>'''

soup = BeautifulSoup(xml, 'lxml')
Test usage:
>>> info = soup.find_all('wpt')[1].text.strip()
>>> print(info)
blah
phone=123456
[email protected]
Fix so it become a dictionary for easier access.
>>> info = info.replace('blah', 'name=blah', 1)
>>> record = {}
>>> for i in info.splitlines():
...      key, value = i.split('=')     
...      record[key] = value
...      
>>> 
>>> record['name']
'blah'
>>> record['mail']
'[email protected]'
>>> record['phone']
'123456' 
Reply
#5
I suspected there were a better way.

Thank you!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to parse and group hierarchical list items from an unindented string in Python? ann23fr 0 179 Mar-27-2024, 01:16 PM
Last Post: ann23fr
  Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att Melcu54 9 1,470 Jun-28-2023, 11:13 AM
Last Post: Melcu54
  Read directory listing of files and parse out the highest number? cubangt 5 2,338 Sep-28-2022, 10:15 PM
Last Post: Larz60+
  Give a number for Variable quest 2 1,502 Jan-31-2022, 08:35 AM
Last Post: ibreeden
  [solved] Variable number of dictionnaries as argument in def() paul18fr 11 6,119 Apr-20-2021, 11:15 AM
Last Post: paul18fr
  Parse String between 2 Delimiters and add as single list items lastyle 5 3,344 Apr-11-2021, 11:03 PM
Last Post: lastyle
  Count number of occurrences of list items in list of tuples t4keheart 1 2,371 Nov-03-2020, 05:37 AM
Last Post: deanhystad
  Please support regex for version number (digits and dots) from a string Tecuma 4 3,173 Aug-17-2020, 09:59 AM
Last Post: Tecuma
  Print the number of items in a list on ubuntu terminal buttercup 2 1,943 Jul-24-2020, 01:46 PM
Last Post: ndc85430
  counting items in a list of number combinations Dixon 2 2,070 Feb-19-2020, 07:06 PM
Last Post: Dixon

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020