Python Forum

Full Version: unicode within a RE grouping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi

I'm getting an error for the following:
import re
pattern = re.compile(r"(?u)\w+")
list = pattern.findall(ur"ñ")
print(list)
Error:
list = pattern.findall(ur"ñ") ^ SyntaxError: invalid syntax
Can anybody suggest what the problem might be ?
If you look at Python lexical analysis rules, you can see that stringprefix doesn't contain ur as a prefix for strings. So, ur"some_string" is illegal construction in Python.
It's illegal in python 3 as pointed out bye @scidam.
In python 2 ur prefix was used when needed to combine raw string and Unicode in a regex pattern.
# Python 2.7
>>> import re
>>>
>>> pattern = re.compile(ur"(ñ)")
>>> uni_char = pattern.search(u'helloñ world')
>>> uni_char.group(1)
u'\xf1'
>>> print(uni_char.group(1))
ñ
One of the biggest changes moving to Python 3 was Unicode.
In Python 3 are all strings sequences of Unicode character.
So there is no longer need for u prefix,we will not see u'\xf1'.
r raw string should still always be used in regex patten,because of escape character.
# Python 3.6
>>> s = 'ñ'
>>> s
'ñ'
# Python 3.6
>>> import re
>>>
>>> pattern = re.compile(r"(ñ)")
>>> uni_char = pattern.search('helloñ world')
>>> uni_char.group(1)
'ñ'