Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Findall() ReGex
#1
Hi,

I have this regex:

Quote:content = """encoding="UTF-8"\?>(.*?)^<\?xml version="1.0"

MYCO Please have a look at this building’s premium. It looks to be a very high rate. The client has a few policies with MYCO as supporting business. 00000 COUTINHO COUTINHO Thanks WilL Sel: 000 000 00 Primary Cooperative Ltd Quanta Primary Ltd i.


encoding="UTF-8"\?>(.*?)^<\?xml version="1.0"


Dear Mr New Co thank you and your team for all the assistance throughout the years. Unfortunately, I have decided to depart from Origin and MYCO for personal reasons. I've attached the cancellation letter for my policy to be implemented Chemical Engineering t: +000 000 4292 c: +27 (0) 00 000 1002 w: 0002 2805 7135 PO Box 1906 bigville 0000 New Way, Bigville, which is the property of the sender. Should you have received this email in error, please delete and destroy it and any attachments thereto immediately. Under no circumstances will the sender of this email be liable to any party for any direct, indirect, special or other consequential damages for any use of this email. For the detailed e-mail disclaimer please refer to. """

found = re.findall(r'encoding="UTF-8"\?>(.*?)^<\?xml version="1.0"', content, re.DOTALL)
print(found)
Output:
[]
My understanding is that "findall" return the List of matches from the text you are searching. I do have a match in my text. can someone explain to me why am I getting nothing in return?
Reply
#2
I don't understand are you trying to parse a string that already have regex applied to it?
The normal way would be like this with a header/Prolog in xml.
import re

content = '<?xml version="1.0" encoding="UTF-8"?>'
result = re.findall(r'version="(.*)" encoding="(.*)"', content)
print(result)
Output:
[('1.0', 'UTF-8')]
If it's a xml file should really not be using regex,but a parser eg BeautifulSoup .
Reply
#3
snippsat's tells you why you should not be doing what you are doing. Hopefully this explains why what you tried to do is not working as you expect.

Your regex pattern does not match your example because the pattern contains several characters with special meanings (metacharacters) that you are treating as literal characters. For example, this pattern does not match any of your example string.
r'encoding="UTF-8"\?>'
The reason it does not match is that "\" is a metacharacter. "\" tells regex that the following character (?) should be treated as a literal. The pattern matches this string.
r'encoding="UTF-8"?>'
Other metacharacters used in your pattern are:
. ^ * ? ( )
Read about them here:
https://docs.python.org/3/howto/regex.html
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020