Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parse XML line by line
#1
Hi, I want to retrieve the information from XML line that looks like this:

<pdce:ExploratoryDrilling contextRef="FD2016Q4YTD" decimals="-3" id="Fact-FA88F003169A4B0FBD05B0E8D5017E3E" unitRef="usd">180000</pdce:ExploratoryDrilling>

The value I need t is 180000.

This line has its unique identifier, which is: "pdce:ExploratoryDrilling contextRef="FD2013Q4YTD" (the use of only "pdce:ExploratoryDrilling " will now work since there are other lines with this text). At the same time I cannot use "id="Fact-0AD7AA10634C504BB2614E0B821523C8" as identifier because later I want to iterate through XLM files and this parameters changes from file to file. So the only identifier that remains constant is "pdce:ExploratoryDrilling contextRef="FD2013Q4YTD"

I used to copy the XML file to .txt file and parse line by line until python encounterd the identifier and then apply regex to retrieve the content between > < symbols.

But when I apply the same concept to XML it does not work, since urllib.request returns a byte-like format which I cannot use for this purpose.

Can you please advise what can be a walkaround  for this task? I guess lxml.etree can be an alternative but I can not figure out how to find the required line.

So far this is what I have

import urllib.request

url = 'https://www.sec.gov/Archives/edgar/data/77877/000007787714000013/pdce-20131231.xml'
tag = '<pdce:ExploratoryDrilling contextRef="FD2013Q4YTD'

source = urllib.request.urlopen(url).readlines()

for line in source:
   if tag in line:
       print(re.findall(r'>(.*?)<',line))

UPDATE:
I actually managed to parse line by line with str(line,'utf-8'). This converts byte-like string to a str format. But still interested in other more pythonic solutions.
Reply
#2
There's two ways to parse xml.  The first, and probably more common method, is DOM, where the whole document is parsed, and then handed to you as a tree you can navigate.  This is how html is handled, and is fine for xml documents that are small enough to fit into ram.

The other method is SAX, and instead of handing you the document once it's parsed, it fires off events as it parses the xml.  This is the method that's pretty much required if the document is too big to fit in ram.

The main differences are that dom is much easier to use, but sax is faster and uses less resources.

What you did is neither, you just treat it like a plain file instead of as xml.  Which is fine since it works, but if there's any extra whitespace, or the attributes are in a different order, it'll stop working (or you'll only get some of the results).

If you're looking for something faster, the easiest thing to do would be to try compiling the regex you're using before you start looping over the document.  Though, I think in recent versions of re, the last couple regexes you used are cached by the module, so you probably wouldn't notice any improvement there.

The other thing you could try, is using sax.  Since you're looking for a more pythonic way of doing things, I'll share how I'd probably do it...
import re
import urllib.request
import xml.sax

url = 'https://www.sec.gov/Archives/edgar/data/77877/000007787714000013/pdce-20131231.xml'


def original():
   tag = '<pdce:ExploratoryDrilling contextRef="FD2013Q4YTD'
   source = urllib.request.urlopen(url).readlines()

   #regex = re.compile(r">(.*?)<")
   results = []
   for line in source:
       line = line.decode()
       if tag in line:
           results.append(re.findall(r">(.*?)<", line))
   return results


class SecEdgarHandler(xml.sax.ContentHandler):
   def __init__(self):
       self.relevant_element = False
       self.matches = []

   def startElement(self, name, attrs):
       if name == "pdce:ExploratoryDrilling":
           context = attrs["contextRef"]
           if context == "FD2013Q4YTD":
               self.relevant_element = True

   def endElement(self, name):
       # doesn't matter what the element is called, if it's ending, we're
       # done caring
       self.relevant_element = False

   def characters(self, data):
       if self.relevant_element:
           self.matches.append(data)


def nilamo():
   handler = SecEdgarHandler()
   xml.sax.parse(urllib.request.urlopen(url), handler)
   return handler.matches


if __name__ == "__main__":
   import timeit

   for method in ["nilamo", "original"]:
       time = timeit.timeit(
           "{0}()".format(method), globals=globals(), number=10)
       print("Time: {0} - Method: {1}".format(time, method))
10 iterations each isn't really enough, but it's enough to wear my patience thin :p
In any event, they both perform perfectly fine, with yours using about 4mb more ram and slightly more cpu.  The timing differences can be placed squarely on the network. A single run of either of them is less than a second, including network latency, which is fine.
Output:
D:\Projects\playground>python xml-test.py Time: 9.730379776340019 - Method: nilamo Time: 8.586807403289992 - Method: original D:\Projects\playground>python xml-test.py Time: 8.480345109714438 - Method: nilamo Time: 8.929470898404862 - Method: original D:\Projects\playground>python xml-test.py Time: 8.74076554180874 - Method: nilamo Time: 9.251647701750871 - Method: original
Reply
#3
I never use urllib or parser(DOM/SAX) from stander library,
there are better solution that almost is a standard in the Python world.
Eg Requests , BeautifulSoup, lxml
Example:
import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/77877/000007787714000013/pdce-20131231.xml'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml-xml')
drill = soup.find('ExploratoryDrilling', attrs={'contextRef':"FD2013Q4YTD"})
print(drill.text) #--> 58988000
Reply
#4
Regex's have their place but parsing XML, HTML and similar text-based files is NOT one of them.
It is problems like this that gave rise to the adage: "if you have a problem and think that REGEX is then answer, then you have two problems".
The thing is that you either need to write a regex expression that is so tight that you don't get false positive matches but you get false negatives; or you write something that is so loose that you get false positives. You are in trouble either way.
Susan (a REGEX campaigner of old)
Reply
#5
I also use beautiful soup and requests, and find it easy to use.
One thing I like to do with XML is to use the css select option in
beautifulsoup.
Here's an example that scrapes an XML index and creates a json file
from the data using the css select method (for the first big list):
(set use_local_data=True to False for 1st time run)
# Copyright 2017 Larz60+
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of
# this software and associated documentation files (the "Software"), to deal in the
# Software without restriction, including without limitation the rights to use,
# copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the
# Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
# PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
# Credits: All RFC data comes from: https://www.rfc-editor.org
#
from bs4 import BeautifulSoup
import requests
import socket
import sys
import json


class ExtractRFC:
   def __init__(self, use_local_data=True):
       self.use_local_data = use_local_data
       self.xurl = 'https://www.rfc-editor.org/rfc-index.xml'
       self.soup, self.rfc_entries = self.make_soup()
       self.tdict = {}
       self.parse()

   def make_soup(self):
       if self.use_local_data or socket.gethostbyname(socket.gethostname()) != '127.0.0.1':
           try:
               x = None
               if self.use_local_data:
                   with open('data\\rfc-index.xml') as f:
                       x = f.read()
               else:
                   with open('data\\rfc-index.xml', 'wb') as f:
                       x = requests.get(self.xurl, stream=True).text
                       f.write(x.encode(sys.stdout.encoding, errors='replace'))
               soup = BeautifulSoup(x, 'lxml')
               rfc_entries = soup.select('rfc-entry')
           except:
               print("Unexpected error:", sys.exc_info()[0])
       return soup, rfc_entries

   def parse(self):
       for entry in self.rfc_entries:
           tag_in_dict = []
           doc_id = entry.find('doc-id').text
           self.tdict[doc_id] = {}
           for tag in entry.find_all():
               if tag.parent.name != 'rfc-entry':
                   continue
               same_tag = len(entry.find_all(tag.name))
               stag = tag.find_all()
               slen = len(stag)
               dup_tags = 0
               if slen:
                   dup_tags = len(tag.find_all(stag[0].name))
               if tag.name not in tag_in_dict:
                   tag_in_dict.append(tag.name)
                   if same_tag > 1:
                       self.tdict[doc_id][tag.name] = []
                   else:
                       self.tdict[doc_id][tag.name] = {}
               if slen:
                   for stag in tag.find_all():
                       if stag.parent.name != tag.name:
                           continue
                       if same_tag > 1:
                           self.tdict[doc_id][tag.name].append(stag.text)
                       else:
                           self.tdict[doc_id][tag.name][stag.name] = stag.text
               else:
                   self.tdict[doc_id][tag.name] = tag.text
       if not self.use_local_data:
           with open('data\RFCindex.json', 'w') as jout:
               json.dump(self.tdict, jout)


if __name__ == '__main__':
   ExtractRFC(use_local_data=True)
Reply
#6
Thanks for you advises

I love beautifulsoup too and use it for html parsing, but couldn't figure out how to find a line with an additional attribute such as contextRef. Now I got it. Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Monitor specific line of code and get alert Olimpiarob 0 1,515 Jul-08-2020, 10:06 AM
Last Post: Olimpiarob
  expecting value: line 1 column 1 (char 0) in print (r.json)) loutsi 3 7,646 Jun-05-2020, 08:38 PM
Last Post: nuffink
  Striping the empty line Calli 8 3,163 May-24-2020, 02:47 PM
Last Post: Calli
  Python to interact with the Linux Command Line - Centos/RHEL redhat_boy 2 2,196 May-10-2020, 08:33 AM
Last Post: redhat_boy
  How to get a new line Calli 2 1,957 Apr-19-2020, 12:17 PM
Last Post: Calli
  Scraping from multiple URLS to print in a single line. jb89 4 3,354 Jan-29-2020, 06:12 AM
Last Post: perfringo
  how calculate length of detected line in image openCV Numpy taomihiranga 0 4,356 Jun-11-2019, 04:01 PM
Last Post: taomihiranga
  [Flask] Uploading CSV file to flask, only first line being uploaded. Help ! KirkmanJ 2 6,786 Jun-25-2018, 02:24 PM
Last Post: KirkmanJ
  command line: python -c 'code here' Skaperen 7 7,750 Mar-24-2018, 08:31 AM
Last Post: Gribouillis
  Beautiful soup opens python command line and nothing happens Prince_Bhatia 4 4,492 Aug-01-2017, 11:50 AM
Last Post: Prince_Bhatia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020