Regex findall() - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Regex findall() (/thread-28229.html) |
Regex findall() - NewBeie - Jul-10-2020 Hi I have re code with find findall() function. I understand that findall supposed to find a certain match in the given string, but the output that I'm getting suggest something else, I'm getting everything else except what I'm searching. Can anybody explain to me what's happening here? I have a xml file with this data <?xml version="1.0" encoding="UTF-8"?> From: [email protected] Sent: 22 November 11:10 AM To: [email protected] Good day, Claim number: 1234567 Policy number: 2468 EA Ref number: 19-24567-R-01 Client details: Client One She was rude Kind regards Person <?xml version="1.0">this is the code: import re with open('test_file.xml', 'rb') as f: file_content = f.read() decoded = file_content.decode('iso-8859-1') found = re.findall(r'encoding="UTF-8"\?>(.*?)^<\?xml version="1.0"', decoded, re.M | re.S) print(found)Output, It looks like the code cleaned the text: Can someone please explain this code for, why do we getthis with findall function
RE: Regex findall() - ndc85430 - Jul-10-2020 Why isn't that what you expect? Your regular expression is capturing everything between 'encoding="UTF-8"\?>' and ^<\?xml version="1.0"' , which is what you're seeing in the result.Also, a couple of other things: 1. Your file isn't valid XML. 2. If you do have valid XML that you want to parse and extract data from, then use a library like [url=https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree]ElementTree[/inline] (which is part of Python's standard library), rather than regular expressions. XML is meant to be machine readable of course! RE: Regex findall() - DeaD_EyE - Jul-10-2020 You should validate your XML-Dcoument. Try to have valid XML data. Parsing this with regex should be possible, but it's error-prone. A xml-document should loook like this: <?xml version="1.0" encoding="UTF-8"?> <entries> <person name="Person Surname"> <from>[email protected]</from> <sent>22 November 11:10 AM</sent> <to>[email protected]</to> <claim_number>1234567</claim_number> <policy_number>2468</policy_number> <ea_ref_number>19-24567-R-01</ea_ref_number> </person> <person name="DeaD_EyE"> <from>[email protected]</from> <sent>22 November 11:10 AM</sent> <to>[email protected]</to> <claim_number>1234567</claim_number> <policy_number>2468</policy_number> <ea_ref_number>666-666-666-666</ea_ref_number> </person> </entries>Working with this data: import xml.etree.ElementTree as ET xml_str = """<?xml version="1.0" encoding="UTF-8"?> <entries> <person name="Person Surname"> <from>[email protected]</from> <sent>22 November 11:10 AM</sent> <to>[email protected]</to> <claim_number>1234567</claim_number> <policy_number>2468</policy_number> <ea_ref_number>19-24567-R-01</ea_ref_number> </person> <person name="DeaD_EyE"> <from>[email protected]</from> <sent>22 November 11:10 AM</sent> <to>[email protected]</to> <claim_number>1234567</claim_number> <policy_number>2468</policy_number> <ea_ref_number>666-666-666-666</ea_ref_number> </person> </entries> """ doc = ET.fromstring(xml_str) for person in doc.findall("person"): print("Name:", person.get("name")) print("Ref number", person.find("ea_ref_number").text) print()
|