Python Forum
Check for funny characters with a regexp
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Check for funny characters with a regexp
#1
I want to be able to check if a string contains any of the following "funny" characters:

"ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ"

Let's say the string is "abcdefghijklmnopqrstuvwxyz".

This is just a small element of a large application where for complicated reasons the check must be done with a regular expression and sometimes in a case-insentitive way. The check can be just a part of a larger regexp that can contain practially anything else.

I tried the following obvious solution:

re.findall('(?i)[ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ]', 'abcdefghijklmnopqrstuvwxyz')
Result:

['i', 's']
That is a fail. The characters "i" and "s" are not in that regular expression. If i remove "(?i)" it works, but that is not an option.

Is there any way to solve this? (The same thing works without any problem if I use Perl...)

I have tried everything I can think of. Any advice is welcome.

I use Python 3 (of course).
Reply
#2
The problem is that I and S belong to the uppercase version of the character sequence
>>> s = "ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ"
>>> s.upper()
'ĀĂĄĄĆĆČČĎĎĐĐĒĔĖĖĘĚĞĞĠĪĮİIĽĽŁŁŃŃŅŇŊŌŌŐŒŒŔŘŚŚŞŞŠŠŢŢŤŦŨŪŮŰŲŹŹŻŻŽŽS'
>>> 'I' in s.upper()
True
>>> 'S' in s.upper()
True
You could perhaps work without the case insensitive regex flag. It would be a good idea to compare with what Perl tells you about the uppercase version of the sequence.
Reply
#3
Quote:?
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
from the docs python docs
>>> re.findall('(i?)[ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ]', 'abcdefghijklmnopqrstuvwxyz')
[]
Quote:The solution chosen by the Perl developers was to use (?...) as the extension syntax. ? immediately after a parenthesis was a syntax error because the ? would have nothing to repeat, so this didn’t introduce any compatibility problems. The characters immediately after the ? indicate what extension is being used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) is something else (a non-capturing group containing the subexpression foo).
from how to here
Reply
#4
(Jan-18-2020, 09:08 PM)bertilow Wrote: I want to be able to check if a string contains any of the following "funny" characters

There is built-in any() which could be used for checking:


>>> checklist = "ā㥹ĆćČčĎďĐđēĕĖėęěĞğġīįİıĽľŁłŃńņňŋŌōőŒœŕřŚśŞşŠšŢţťŧũūŮűųŹźŻżŽžſ"
>>> s = "abcdefghijklmnopqrstuvwxyz"
>>> any(char in checklist for char in s)
False
>>> s = 'abcā'
>>> any(char in checklist for char in s)
True
Explained:

for char in s for every character in string to be checked
char in checklist return boolean value whether character is in checklist
any() has short-circuit behaviour i.e. if first character which is in checklist (boolean value True) is found it stops and returns True
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
(Jan-18-2020, 11:41 PM)Gribouillis Wrote: The problem is that I and S belong to the uppercase version of the character sequence

Yes, that's what I figured. Thanks for confirming that. I was hoping for some hidden trick to make Python use the same logic as Perl does for such cases, but I guess there is no such magic trick. Maybe in the future...

Both Perl and Python do the right thing, I now believe, but have made different choices - both logical in their own way. For my purposes the Perl solution is more convenient, but for other, more important reasons, I'm still going ahead with porting the whole application to Python. I can live with this very minor inconvenience since this is a very narrow corner-case in my application. I have now managed to get around it by not allowing case insensitive searching in certain cases, and not bothering with other cases that really don't matter in the context of the application.

Thanks for the help! Much appreciated!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How do I check if the first X characters of a string are numbers? FirstBornAlbratross 6 1,521 Apr-12-2023, 10:39 AM
Last Post: jefsummers
  Remove escape characters / Unicode characters from string DreamingInsanity 5 13,679 May-15-2020, 01:37 PM
Last Post: snippsat
  Check for a special characters in a column and flag it ayomayam 0 2,044 Feb-12-2020, 03:04 PM
Last Post: ayomayam
  RegExp: returning 2nd loop in new document syoung 5 3,865 May-02-2018, 12:36 PM
Last Post: syoung
  check if value of passed variable has uppercase characters in it. wfsteadman 3 3,234 Sep-01-2017, 05:52 PM
Last Post: metulburr
  Regexp that won't match anything Ofnuts 4 4,107 Mar-17-2017, 02:48 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020