Jun-06-2017, 09:20 PM
(This post was last modified: Jun-06-2017, 09:40 PM by rakhmadiev.)
Hi, I want to retrieve the information from XML line that looks like this:
<pdce:ExploratoryDrilling contextRef="FD2016Q4YTD" decimals="-3" id="Fact-FA88F003169A4B0FBD05B0E8D5017E3E" unitRef="usd">180000</pdce:ExploratoryDrilling>
The value I need t is 180000.
This line has its unique identifier, which is: "pdce:ExploratoryDrilling contextRef="FD2013Q4YTD" (the use of only "pdce:ExploratoryDrilling " will now work since there are other lines with this text). At the same time I cannot use "id="Fact-0AD7AA10634C504BB2614E0B821523C8" as identifier because later I want to iterate through XLM files and this parameters changes from file to file. So the only identifier that remains constant is "pdce:ExploratoryDrilling contextRef="FD2013Q4YTD"
I used to copy the XML file to .txt file and parse line by line until python encounterd the identifier and then apply regex to retrieve the content between > < symbols.
But when I apply the same concept to XML it does not work, since urllib.request returns a byte-like format which I cannot use for this purpose.
Can you please advise what can be a walkaround for this task? I guess lxml.etree can be an alternative but I can not figure out how to find the required line.
So far this is what I have
UPDATE:
I actually managed to parse line by line with str(line,'utf-8'). This converts byte-like string to a str format. But still interested in other more pythonic solutions.
<pdce:ExploratoryDrilling contextRef="FD2016Q4YTD" decimals="-3" id="Fact-FA88F003169A4B0FBD05B0E8D5017E3E" unitRef="usd">180000</pdce:ExploratoryDrilling>
The value I need t is 180000.
This line has its unique identifier, which is: "pdce:ExploratoryDrilling contextRef="FD2013Q4YTD" (the use of only "pdce:ExploratoryDrilling " will now work since there are other lines with this text). At the same time I cannot use "id="Fact-0AD7AA10634C504BB2614E0B821523C8" as identifier because later I want to iterate through XLM files and this parameters changes from file to file. So the only identifier that remains constant is "pdce:ExploratoryDrilling contextRef="FD2013Q4YTD"
I used to copy the XML file to .txt file and parse line by line until python encounterd the identifier and then apply regex to retrieve the content between > < symbols.
But when I apply the same concept to XML it does not work, since urllib.request returns a byte-like format which I cannot use for this purpose.
Can you please advise what can be a walkaround for this task? I guess lxml.etree can be an alternative but I can not figure out how to find the required line.
So far this is what I have
import urllib.request url = 'https://www.sec.gov/Archives/edgar/data/77877/000007787714000013/pdce-20131231.xml' tag = '<pdce:ExploratoryDrilling contextRef="FD2013Q4YTD' source = urllib.request.urlopen(url).readlines() for line in source: if tag in line: print(re.findall(r'>(.*?)<',line))
UPDATE:
I actually managed to parse line by line with str(line,'utf-8'). This converts byte-like string to a str format. But still interested in other more pythonic solutions.