Python Forum
Regex findall() - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Regex findall() (/thread-28229.html)



Regex findall() - NewBeie - Jul-10-2020

Hi

I have re code with find findall() function. I understand that findall supposed to find a certain match in the given string, but the output that I'm getting suggest something else, I'm getting everything else except what I'm searching.

Can anybody explain to me what's happening here?

I have a xml file with this data

<?xml version="1.0" encoding="UTF-8"?>
From: [email protected]
Sent: 22 November 11:10 AM
To: [email protected]
Good day,
Claim number:   1234567
Policy number:   2468
EA Ref number:   19-24567-R-01

Client details:   Client One

She was rude
Kind regards

Person
<?xml version="1.0">
this is the code:
import re

with open('test_file.xml', 'rb') as f:
    file_content = f.read()

decoded = file_content.decode('iso-8859-1')
found = re.findall(r'encoding="UTF-8"\?>(.*?)^<\?xml version="1.0"', decoded, re.M | re.S)
print(found)
Output, It looks like the code cleaned the text: Can someone please explain this code for, why do we getthis with findall function
Output:
['\nFrom: [email protected]\nSent: 22 November 11:10 AM\nTo: [email protected]\nGood day,\nClaim number: 1234567\nPolicy number: 2468\nEA Ref number: 19-24567-R-01\n\nClient details: Client One\n\nShe was rude\nKind regards\n\nPerson\n']



RE: Regex findall() - ndc85430 - Jul-10-2020

Why isn't that what you expect? Your regular expression is capturing everything between 'encoding="UTF-8"\?>' and ^<\?xml version="1.0"', which is what you're seeing in the result.

Also, a couple of other things:

1. Your file isn't valid XML.
2. If you do have valid XML that you want to parse and extract data from, then use a library like [url=https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree]ElementTree[/inline] (which is part of Python's standard library), rather than regular expressions. XML is meant to be machine readable of course!


RE: Regex findall() - DeaD_EyE - Jul-10-2020

You should validate your XML-Dcoument. Try to have valid XML data.
Parsing this with regex should be possible, but it's error-prone.


A xml-document should loook like this:
<?xml version="1.0" encoding="UTF-8"?>
<entries>
    <person name="Person Surname">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>19-24567-R-01</ea_ref_number>
    </person>

    <person name="DeaD_EyE">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>666-666-666-666</ea_ref_number>
    </person>
</entries>
Working with this data:
import xml.etree.ElementTree as ET


xml_str = """<?xml version="1.0" encoding="UTF-8"?>
<entries>
    <person name="Person Surname">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>19-24567-R-01</ea_ref_number>
    </person>

    <person name="DeaD_EyE">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>666-666-666-666</ea_ref_number>
    </person>
</entries>
"""

doc = ET.fromstring(xml_str)
for person in doc.findall("person"):
    print("Name:", person.get("name"))
    print("Ref number", person.find("ea_ref_number").text)
    print()
Output:
Name: Person Surname Ref number 19-24567-R-01 Name: DeaD_EyE Ref number 666-666-666-666