Python Forum

Full Version: extracting XML data into datframe
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am querying an API that returns the following, I am trying to extract the name and flow rate data using the below code, but can't work out why its not working?

Quote:b'<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body><GetInstantaneousFlowDataResponse xmlns="http://www.NationalGrid.com/EDP/UI/"><GetInstantaneousFlowDataResult>
<ReportName xmlns="http://www.NationalGrid.com/EDP/BusinessEntities/Public">Instantaneous Flow Report</ReportName>
<PublishedTime xmlns="http://www.NationalGrid.com/EDP/BusinessEntities/Public">2018-07-29T10:47:01</PublishedTime>
<EDPReportPage xmlns="http://www.NationalGrid.com/EDP/BusinessEntities/Public">
<PageName>page1</PageName>
<CurrentGasDay>2018-07-29T00:00:00</CurrentGasDay>
<EDPEnergyGraphTableCollection>
<EDPEnergyGraphTableBE>
<EDPEnergyGraphTableName>table1</EDPEnergyGraphTableName>
<ItemPosition>1</ItemPosition>
<EDPObjectCollection><EDPObjectBE>
<EDPObjectName>ALDBROUGH</EDPObjectName>
<EnergyDataList><EDPEnergyDataBE>
<ApplicableAt>2018-07-29T10:36:00</ApplicableAt>
<FlowRate>0</FlowRate>
<QualityIndicator />
<ScheduleTime>2018-07-29T10:48:00</ScheduleTime>
</EDPEnergyDataBE><EDPEnergyDataBE>
<ApplicableAt>2018-07-29T10:38:00</ApplicableAt>
<FlowRate>0</FlowRate>
<QualityIndicator />
<ScheduleTime>2018-07-29T10:48:00</ScheduleTime>
</EDPEnergyDataBE><EDPEnergyDataBE>
<ApplicableAt>2018-07-29T10:40:00</ApplicableAt>
<FlowRate>0</FlowRate><QualityIndicator

import requests
from lxml import etree
    
fromDate = "2018-07-29"

def getXML():
    url="http://energywatch.natgrid.co.uk/EDP-PublicUI/PublicPI/InstantaneousFlowWebService.asmx"
    headers = {'content-type': 'application/soap+xml; charset=utf-8'}
    body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
  <soap12:Body>
    <GetInstantaneousFlowData xmlns="http://www.NationalGrid.com/EDP/UI/" />
  </soap12:Body>
</soap12:Envelope>"""

    response = requests.post(url,data=body,headers=headers)
    return response.content

import pandas as pd
df1 = pd.DataFrame(columns=("name","value"))   
for pd_date in pd.date_range(fromDate, periods=1):
    day = pd_date.strftime('%Y-%m-%d')
     
    root = etree.fromstring(getXML())
     
    #map prefix 'd' to the default namespace URI
    ns = {'d': 'http://www.NationalGrid.com/EDP/UI/'}
     
    publication_objects = root.xpath('//d:EDPObjectCollection', namespaces=ns)
     
    for obj in publication_objects:
        name = obj.find('d:EDPObjectName', ns).text
     
        for data in obj.findall('d:EnergyDataList/d:EDPEnergyDataBE', ns):
            value = float(data.find('d:FlowRate', ns).text)
        
            df1.loc[len(df1) +1] = [name, value]
            
(Jul-29-2018, 02:54 PM)hey_arnold Wrote: [ -> ]can't work out why its not working
not working is not clear. not working how?
Do you get an exception? In this case post the full traceback in error tags.
The result is not what you expect? In this case show us what you expect it to be.
This is the returned XML:
Output:
b' <?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <soap:Body> <GetInstantaneousFlowDataResponse xmlns="http://www.NationalGrid.com/EDP/UI/"> <GetInstantaneousFlowDataResult> <ReportName xmlns="http://www.NationalGrid.com/EDP/BusinessEntities/Public">Instantaneous Flow Report </ReportName> <PublishedTime xmlns="http://www.NationalGrid.com/EDP/BusinessEntities/Public">2018-07-29T17:11:00 </PublishedTime> <EDPReportPage xmlns="http://www.NationalGrid.com/EDP/BusinessEntities/Public"> <PageName>page1</PageName> <CurrentGasDay>2018-07-29T00:00:00</CurrentGasDay> <EDPEnergyGraphTableCollection> <EDPEnergyGraphTableBE> <EDPEnergyGraphTableName>table1</EDPEnergyGraphTableName> <ItemPosition>1</ItemPosition> <EDPObjectCollection> <EDPObjectBE> <EDPObjectName>ALDBROUGH</EDPObjectName> <EnergyDataList> <EDPEnergyDataBE> <ApplicableAt>2018-07-29T17:00:00</ApplicableAt> <FlowRate>0</FlowRate> <QualityIndicator /> <ScheduleTime>2018-07-29T17:12:00</ScheduleTime> </EDPEnergyDataBE> <EDPEnergyDataBE> <ApplicableAt>2018-07-29T17:02:00</ApplicableAt> <FlowRate>0</FlowRate> <QualityIndicator /> <ScheduleTime>2018-07-29T17:12:00</ScheduleTime> </EDPEnergyDataBE> <EDPEnergyDataBE> <ApplicableAt>2018-07-29T17:04:00</ApplicableAt> <FlowRate>0</FlowRate> <QualityIndicator /> <ScheduleTime>2018-07-29T17:12:00</ScheduleTime> </EDPEnergyDataBE> <EDPEnergyDataBE> <ApplicableAt>2018-07-29T17:06:00</ApplicableAt> <FlowRate>0</FlowRate> <QualityIndicator /> <ScheduleTime>2018-07-29T17:12:00</ScheduleTime> </EDPEnergyDataBE> <EDPEnergyDataBE> <ApplicableAt>2018-07-29T17:08:00</ApplicableAt> <FlowRate>0</FlowRate> <QualityIndicator /> <ScheduleTime>2018-07-29T17:12:00</ScheduleTime> </EDPEnergyDataBE> <EDPEnergyDataBE> <ApplicableAt>2018-07-29T17:10:00</ApplicableAt> <FlowRate>0</FlowRate> <QualityIndicator /> <ScheduleTime>2018-07-29T17:12:00</ScheduleTime> </EDPEnergyDataBE> </EnergyDataList> </EDPObjectBE> </EDPObjectCollection> <Description>Total Supply</Description> </EDPEnergyGraphTableBE> </EDPEnergyGraphTableCollection> </EDPReportPage> </GetInstantaneousFlowDataResult> </GetInstantaneousFlowDataResponse> </soap:Body> </soap:Envelope>'
I would like to extract the following information: EDPObjectName, ApplicableAt, Flow Rate & ScheduleTime. I have tried modifying some code I have found on this forum, but my dataframe does not populate with any data, it just has the column names.

No exceptions are produced when I run the code.

I was hoping that the data would be picked out of the XML and put into a dataframe.

import requests
from lxml import etree
    
fromDate = "2018-07-29"

def getXML():
    url="http://energywatch.natgrid.co.uk/EDP-PublicUI/PublicPI/InstantaneousFlowWebService.asmx"
    headers = {'content-type': 'application/soap+xml; charset=utf-8'}
    body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
  <soap12:Body>
    <GetInstantaneousFlowData xmlns="http://www.NationalGrid.com/EDP/UI/" />
  </soap12:Body>
</soap12:Envelope>"""

    response = requests.post(url,data=body,headers=headers)
    return response.content

import pandas as pd
df1 = pd.DataFrame(columns=("name","value"))   
for pd_date in pd.date_range(fromDate, periods=1):
    day = pd_date.strftime('%Y-%m-%d')
     
    root = etree.fromstring(getXML())
     
    #map prefix 'd' to the default namespace URI
    ns = {'d': 'http://www.NationalGrid.com/EDP/UI/'}
     
    publication_objects = root.xpath('//d:EDPReportBE', namespaces=ns)
     
    for obj in publication_objects:
        name = obj.find('d:EDPObjectName', ns).text

        for data in obj.findall('d:EDPObjectBE/d:EDPEnergyDataBE', ns):
            value = float(data.find('d:FlowRate', ns).text)
        
            df1.loc[len(df1) +1] = [name, value]