Python Forum

Full Version: Check for funny characters with a regexp
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I want to be able to check if a string contains any of the following "funny" characters:

"ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ"

Let's say the string is "abcdefghijklmnopqrstuvwxyz".

This is just a small element of a large application where for complicated reasons the check must be done with a regular expression and sometimes in a case-insentitive way. The check can be just a part of a larger regexp that can contain practially anything else.

I tried the following obvious solution:

re.findall('(?i)[ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ]', 'abcdefghijklmnopqrstuvwxyz')
Result:

['i', 's']
That is a fail. The characters "i" and "s" are not in that regular expression. If i remove "(?i)" it works, but that is not an option.

Is there any way to solve this? (The same thing works without any problem if I use Perl...)

I have tried everything I can think of. Any advice is welcome.

I use Python 3 (of course).
The problem is that I and S belong to the uppercase version of the character sequence
>>> s = "ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ"
>>> s.upper()
'ĀĂĄĄĆĆČČĎĎĐĐĒĔĖĖĘĚĞĞĠĪĮİIĽĽŁŁŃŃŅŇŊŌŌŐŒŒŔŘŚŚŞŞŠŠŢŢŤŦŨŪŮŰŲŹŹŻŻŽŽS'
>>> 'I' in s.upper()
True
>>> 'S' in s.upper()
True
You could perhaps work without the case insensitive regex flag. It would be a good idea to compare with what Perl tells you about the uppercase version of the sequence.
Quote:?
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
from the docs python docs
>>> re.findall('(i?)[ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ]', 'abcdefghijklmnopqrstuvwxyz')
[]
Quote:The solution chosen by the Perl developers was to use (?...) as the extension syntax. ? immediately after a parenthesis was a syntax error because the ? would have nothing to repeat, so this didn’t introduce any compatibility problems. The characters immediately after the ? indicate what extension is being used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) is something else (a non-capturing group containing the subexpression foo).
from how to here
(Jan-18-2020, 09:08 PM)bertilow Wrote: [ -> ]I want to be able to check if a string contains any of the following "funny" characters

There is built-in any() which could be used for checking:


>>> checklist = "ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ"
>>> s = "abcdefghijklmnopqrstuvwxyz"
>>> any(char in checklist for char in s)
False
>>> s = 'abcā'
>>> any(char in checklist for char in s)
True
Explained:

for char in s for every character in string to be checked
char in checklist return boolean value whether character is in checklist
any() has short-circuit behaviour i.e. if first character which is in checklist (boolean value True) is found it stops and returns True
(Jan-18-2020, 11:41 PM)Gribouillis Wrote: [ -> ]The problem is that I and S belong to the uppercase version of the character sequence

Yes, that's what I figured. Thanks for confirming that. I was hoping for some hidden trick to make Python use the same logic as Perl does for such cases, but I guess there is no such magic trick. Maybe in the future...

Both Perl and Python do the right thing, I now believe, but have made different choices - both logical in their own way. For my purposes the Perl solution is more convenient, but for other, more important reasons, I'm still going ahead with porting the whole application to Python. I can live with this very minor inconvenience since this is a very narrow corner-case in my application. I have now managed to get around it by not allowing case insensitive searching in certain cases, and not bothering with other cases that really don't matter in the context of the application.

Thanks for the help! Much appreciated!