Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex findall()
#1
Hi

I have re code with find findall() function. I understand that findall supposed to find a certain match in the given string, but the output that I'm getting suggest something else, I'm getting everything else except what I'm searching.

Can anybody explain to me what's happening here?

I have a xml file with this data

<?xml version="1.0" encoding="UTF-8"?>
From: [email protected]
Sent: 22 November 11:10 AM
To: [email protected]
Good day,
Claim number:   1234567
Policy number:   2468
EA Ref number:   19-24567-R-01

Client details:   Client One

She was rude
Kind regards

Person
<?xml version="1.0">
this is the code:
import re

with open('test_file.xml', 'rb') as f:
    file_content = f.read()

decoded = file_content.decode('iso-8859-1')
found = re.findall(r'encoding="UTF-8"\?>(.*?)^<\?xml version="1.0"', decoded, re.M | re.S)
print(found)
Output, It looks like the code cleaned the text: Can someone please explain this code for, why do we getthis with findall function
Output:
['\nFrom: [email protected]\nSent: 22 November 11:10 AM\nTo: [email protected]\nGood day,\nClaim number: 1234567\nPolicy number: 2468\nEA Ref number: 19-24567-R-01\n\nClient details: Client One\n\nShe was rude\nKind regards\n\nPerson\n']
Reply
#2
Why isn't that what you expect? Your regular expression is capturing everything between 'encoding="UTF-8"\?>' and ^<\?xml version="1.0"', which is what you're seeing in the result.

Also, a couple of other things:

1. Your file isn't valid XML.
2. If you do have valid XML that you want to parse and extract data from, then use a library like [url=https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree]ElementTree[/inline] (which is part of Python's standard library), rather than regular expressions. XML is meant to be machine readable of course!
Reply
#3
You should validate your XML-Dcoument. Try to have valid XML data.
Parsing this with regex should be possible, but it's error-prone.


A xml-document should loook like this:
<?xml version="1.0" encoding="UTF-8"?>
<entries>
    <person name="Person Surname">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>19-24567-R-01</ea_ref_number>
    </person>

    <person name="DeaD_EyE">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>666-666-666-666</ea_ref_number>
    </person>
</entries>
Working with this data:
import xml.etree.ElementTree as ET


xml_str = """<?xml version="1.0" encoding="UTF-8"?>
<entries>
    <person name="Person Surname">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>19-24567-R-01</ea_ref_number>
    </person>

    <person name="DeaD_EyE">
        <from>[email protected]</from>
        <sent>22 November 11:10 AM</sent>
        <to>[email protected]</to>
        <claim_number>1234567</claim_number>
        <policy_number>2468</policy_number>
        <ea_ref_number>666-666-666-666</ea_ref_number>
    </person>
</entries>
"""

doc = ET.fromstring(xml_str)
for person in doc.findall("person"):
    print("Name:", person.get("name"))
    print("Ref number", person.find("ea_ref_number").text)
    print()
Output:
Name: Person Surname Ref number 19-24567-R-01 Name: DeaD_EyE Ref number 666-666-666-666
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  regex findall() returning weird result Radical 1 588 Oct-15-2023, 08:47 PM
Last Post: snippsat
  Python: re.findall to find multiple instances don't work but search worked Secret 1 1,173 Aug-30-2022, 08:40 PM
Last Post: deanhystad
  regex.findall that won't match anything xiaobai97 1 1,972 Sep-24-2020, 02:02 PM
Last Post: DeaD_EyE
  re.findall HELP!!! only returns None Rusty 10 6,818 Jun-20-2020, 12:13 AM
Last Post: Rusty
  The "FindAll" Error BadWhite 6 4,270 Apr-11-2020, 05:59 PM
Last Post: snippsat
  Beginner question: lxml's findall in an xml namespace aecklers 0 2,864 Jan-22-2020, 10:53 AM
Last Post: aecklers
  Issue with re.findall alinaveed786 8 4,768 Oct-20-2018, 09:28 AM
Last Post: volcano63
  [Regex] Findall returns wrong number of hits Winfried 8 5,686 Aug-23-2018, 02:21 PM
Last Post: Winfried
  Combining the regex into single findall syoung 0 2,509 May-28-2018, 10:11 AM
Last Post: syoung
  unable to print the list when using re.findall() satyaneel 5 4,079 Sep-27-2017, 10:26 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020