Python Forum
Remove tag several xml files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Remove tag several xml files
#1
Dear python users,
I want to drop the same tag in several xml files in one folder. Here is a sample of one xml file:
<?xml version='1.0' encoding='UTF-8'?>
<compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
  <uid>CRSANR5L15S2017E1N001</uid>
  <metadonnees>
    <day>04 juillet 2017</day>
  </metadonnees>
  <contenu>
    <quantiemes>
      <journee>Séance du mardi 04 juillet 2017</journee>
    </quantiemes>
    <openSession valeur="" id_syceron="981337" sommaire="1" code_parole="" code_style="Présidence" code_grammaire="OUV_SEAN_1_1" id_nomination_op="0" id_nomination_oe="0" id_mandat="PM722798" id_acteur="PA332747" ordre_absolu_seance="1" id_preparation="819540" ordinal_prise="1" valeur_ptsodj="0" nivpoint="1">
       <orateurs/>
       <texte>Présidence de M. François de Rugy</texte>
    </openSession>
  </contenu>
</compteRendu>
Here is my code:
path = "sourcedirection" #Source 
dstpath = "whereIwanttosavenewxmlfiles" #save as XML in different folder

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        tree = ET.parse(path+"/"+filename) #full path of the XML file with it's name
        roots = tree.findall("contenu")
        for root in roots:
            opensessions = root.findall("openSession")
            for opensession in opensessions:
                tree.remove(opensessions)
        save = dstpath+filename
        tree.write(save, encoding="Latin-1")
Instead of removing the tag, it is added "ns0" in my new xml file.

<?xml version='1.0' encoding='Latin-1'?>
<ns0:compteRendu xmlns:ns0="http://schemas.assemblee-nationale.fr/referentiel">
  <ns0:uid>CRSANR5L15S2017E1N001</ns0:uid>
  <ns0:metadonnees>
    <ns0:day>04 juillet 2017</ns0:day>
  </ns0:metadonnees>
  <ns0:contenu>
    <ns0:quantiemes>
      <ns0:journee>Séance du mardi 04 juillet 2017</ns0:journee>
    </ns0:quantiemes>
    <ns0:openSession valeur="" id_syceron="981337" sommaire="1" code_parole="" code_style="Présidence" code_grammaire="OUV_SEAN_1_1" id_nomination_op="0" id_nomination_oe="0" id_mandat="PM722798" id_acteur="PA332747" ordre_absolu_seance="1" id_preparation="819540" ordinal_prise="1" valeur_ptsodj="0" nivpoint="1">
       <ns0:orateurs />
       <ns0:texte>Présidence de M. François de Rugy</ns0:texte>
    </ns0:openSession>
  </ns0:contenu>
</ns0:compteRendu>
What am I doing wrong?
Reply
#2
This may be of interest: https://stackoverflow.com/a/4681377
Reply
#3
Thank you for your suggestion. I forgot to mention that I want to remove the tag and the respective text. In the link that you mentioned, they want to remove the tag but not the text.
Reply
#4
You should go to the package web page. You can remove just the tags or remove the tags and the content.

https://lxml.de/apidoc/lxml.html.clean.html
Reply
#5
Thank you for your suggestions.
Here is a code that worked:
import lxml
from lxml.html.clean import Cleaner

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        tree = etree.parse(path+"/"+filename) 
        etree.strip_elements(tree, "{*}openSession", with_tail=True)
        save = dstpath+filename
        tree.write(save)
Reply
#6
Bonjour Mesdames et Messieurs!

I am interested in this thread, but re and me are not good friends!

This works, but, how to do this without removing the newline characters??

Tried using re.MULTILINE couldn't get it to work.

import re
path2xml = '/home/pedro/myPython/xml/document2.xml'

with open(path2xml) as dd:
    data = dd.read()

# newline characters cause trouble for re.search and re.sub
# easiest is replace them
newdata = re.sub('\n', 'XYZ', data)
p2get = re.compile(r'<openSession(.*?)</openSession>')
removed_stuff = re.sub(p2get, '', newdata)
# put back the newline characters
result = re.sub('XYZ', '\n', removed_stuff)

savepath = '/home/pedro/myPython/xml/'
with open(savepath + 'result.xml', 'w') as r:
    r.write(result)
Output:
<?xml version='1.0' encoding='UTF-8'?> <compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel"> <uid>CRSANR5L15S2017E1N001</uid> <metadonnees> <day>04 juillet 2017</day> </metadonnees> <contenu> <quantiemes> <journee>Séance du mardi 04 juillet 2017</journee> </quantiemes> </contenu> </compteRendu>
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Rename Multiple files in directory to remove special characters nyawadasi 9 6,414 Feb-16-2021, 09:49 PM
Last Post: BashBedlam
  remove files from folder older than X days kerzol81 2 8,658 Jan-03-2020, 11:55 PM
Last Post: snippsat
  Problem with remove Temp Files karlo_ds 1 3,159 Oct-26-2017, 11:42 PM
Last Post: wavic
  Remove files from a directory pyth0nus3r 3 3,868 Jan-16-2017, 08:30 AM
Last Post: pyth0nus3r

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020