Python Forum
Regular Expression (re module)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regular Expression (re module)
#1
Regular expressions are an excellent tool that all programmers should learn. They provide a compact syntax for expressing complicated text searches. This is a short tutorial on how to use regular expression in Python.

For a starting example, let's look at searching descriptions of injuries for fall related injuries. With normal Python, you might do it like this:

if 'fall' in narrative:
    do_something()
To do it with regular expressions, you would use the re module:

import re

fall_re = re.compile('fall')
if fall_re.search(narrative):
    do_something()
So the first thing you do is you compile the regular expression. Then you can use that regular expression to search the narrative for a match. The search method returns a match object, which we will get into later. For now, you just need to know that if the match fails to find anything, it evaluates as false.

In this simple case, that's some extra typing for nothing extra. But note that the narrative might be written in the past tense, so you have to search for 'fell' as well. With normal Python, that would be:

if 'fall' in narrative or 'fell' in narrative:
    do_something()
With regular expressions, it would be:

import re

fall_re = re.compile('f(a|e)ll')
if fall_re.search(narrative):
    do_something()
Here we used a pipe character (|) in the regular expression to indicate 'or'. So the regular expression searches for an 'f', followed by an 'a' or an 'e', followed by 'll'. Note the parentheses. Without them, the or condition goes to the ends of the regular expression. So 'fa|ell' would search for 'fa' or 'ell'. That would still match what we are looking for, but it would also match a lot of stuff we aren't looking for. The parentheses also make a group, which is something we will make more use of later.

There's another way we could catch fall or fell: 'f.ll'. A period in a regular expression matches any one character (except a new line). So 'f.ll' would match fall or fell, but it would also match 'fill', 'full', and part of 'of llamas'. So not the best choice here.

Of course, the narrative might just mention that the person tripped. With the 'in' operator, you have add another 'or' clause:

if 'fall' in narrative or 'fell' in narrative or 'trip' in narrative:
    do_something()
With regular expressions, we can just do another |, again being careful to use parentheses to indicate exactly what we want to be 'or'ed.

import re

fall_re = re.compile('(f(a|e)ll)|(trip)')
if fall_re.search(narrative):
    do_something()
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#2
At this point, despite it's clumsiness, the version with the 'in' operator is still shorter. Let's use groups to take a look at a situation that is much better suited for regular expressions. Let's say that these injury report narratives also contain information about the victim, such as '26 year-old male.' But since this is done repeatedly, it is abbreviated as '26 yom.' For younger children, the age may be given in months (6 mof) or even days (12 dom). Some of the people writing the narratives drop the 'o' from the abbreviation, or the space after the digits, and some of them add b/w/a/h for the victim's race (such as '46 ywm' for '46 year-old white male). How does a regular expression handle this?

import re

victim_re = re.compile('([0-9]+) ?([dmy])o?([abhw])?([fm])')
match = victim_re.search(narrative)
if match:
    print('Age:', int(match.group(1))
    print('Age Unit:', match.group(2))
    if match.group(3):
        print('Race:', match.group(3))
    print('Sex:', match.group(4))
The first group ([0-9]+), makes use of a character set. The character set is the part in brackets ([]). The character set is any character from 0 to 9. That is, any numeric digit. You can do other ranges, such as [a-z] for lowercase letters. You can do two ranges, such as [a-zA-Z] for any letter. You can even do exclusionary character sets: [^0-9] is anything BUT a numeric digit.

After the character set is a plus (+), and after the first group is a space and a question mark (?). The + means 'match one or more of the previous expression.' So it gets us one or more digits. The ? means 'match zero or one of the previous expression. So it matches a space if there is one, but still matches if there is no space. The asterisk (*) is a modifier similar to + and ? that we don't use in the expression. It means 'match zero or more of the previous expression.'

The rest of the regular expression is just more character sets (without ranges) and more ? modifiers. The key part is using the group method of the match object that is returned by the regular expression search. This gets the value for each numbered group. Note that the numbering is 1 indexed, unlike the normal Python 0 indexing. This allows us to pull out the exact text for each part of the search.

Oddly enough, it is not that easy to pull out the full text that was matched. Note that the match object has a string attribute, which is the string that was passed to it. It also has two methods, start and end, that give the indexes for the slice of the full match. So:

match.string[match.start():match.end()]
will give you the full string that was matched by the regular expression. This is a bit cumbersome, which is why you sometimes see programmers encase their whole regular expression in a group to easily extract it.

One thing to note about the regular expression we're using is that the age could be any number of numeric digits. It would match '52317 yom'. Now, that shouldn't be a problem, but we can fix the regular expression to only catch one to three digit years. There's also a shortcut we can make use of.

victim_re = re.compile('(\d{1,3}) ?([dmy])o?([abhw])?([fm])')
The braces ({}) used in the above regular expression allow for a more specific way to get multiple characters that the ?, *, and + modifiers allow. It allows us to specify a minimum (1) and maximum (3) number of repetitions we will allow.

The \d is one of the special sequences defined for regular expressions. It matches any numeric digit. There are several other special sequences:
  • \A: matches the start of the string
  • \b: matches a word boundary (between characters)
  • \B: matches a non-word boundary (between characters)
  • \D: any non-digit character
  • \s: any white space character
  • \S: any non-white space character
  • \w: any alphanumeric character
  • \W: any non-alphanumeric character
There is a more serious problem with our regular expression. It won't match '18 YOWF'. Yup, regular expressions are case-sensitive. This is an easy fix, however.

victim_re = re.compile('(\d{1,3}) ?([dmy])o?([abhw])?([fm])', re.IGNORECASE)
This is a flag that gives options to how the regular expression works. There are several available flags, although IGNORECASE is the only one I've ever used. Each one also has a way to use it within the expression, such as (?i) for IGNORECASE:

victim_re = re.compile('(?i)(\d{1,3}) ?([dmy])o?([abhw])?([fm])')
  • I or IGNORECASE or (?i): Case insensitive search.
  • L or LOCALE or (?L): Changes special sequences based on locale.
  • M or MULTILINE or (?m): Splits the text into lines and searches the lines.
  • S or DOTALL or (?s): The period (.) matches all characters, including newlines.
  • U or UNICODE or (?u): Changes special sequences based on unicode.
  • X or VERBOSE or (?x): Ignores white space, except in character sets.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
There are few odds and ends I would like to cover before I wrap up this tutorial. One is all the symbols we're using to state the regular expressions. Say you're going through a section of text, and you want to find all of the questions that were asked. You would probably want to search for sentences ending with a question mark (?). However, that's part of the regular expression syntax. To search for an actual question mark, you need to escape it with a backslash, so '\?' would search for a question mark.

Next is the idea of greedy vs. non-greedy. The ?, *, {}, and + operators are what is called greedy operators: they will match as much as they possibly can. Consider the following example:

import re

em_re = re.compile('<em>(.+)</em>')
match = em_re.search('The quick brown <em>fox</em> jumps over the lazy <em>dog</em>')
In this case, match.group(1) will be 'fox</em> jumps of the lazy <em>dog'. It matched from the first <em> it could find to the last </em> it could find. Not what we really wanted. We need to turn the + operator in em_re into a non-greedy operator. We do that with a question mark (?):

import re

em_re = re.compile('<em>(.+?)</em>')
match = em_re.search('The quick brown <em>fox</em> jumps over the lazy <em>dog</em>')
Now match.group(1) will return 'fox', which is what we are after.

From the last example, you may be thinking that regular expressions would be a good way to read through HTML or XML files. Don't think that. Regular expressions are a general text searching tool with broad application. HTML and XML are very specific cases, which have plenty of parsers optimized for dealing with them. Using the specialized parsers for the mark up languages is the way to go.

The second thing you might be thinking about the previous example is that we weren't looking for 'fox', we were looking for 'fox' and 'dog'. How do you do that? This is a good time to introduce some of the other methods of the regular expression object:
  • findall: Returns all matches to the regular expression, or the group(s) defined in the regular expression.
  • finditer: Like findall, but returns an iterator.
  • match: Looks for matches at the start of the text. The start may be defined as an index in the text.
  • search: Looks for the leftmost match.
  • split: Returns a list of the sub-strings separated by matches with the regular expression.
  • sub: Replaces matches with a provided string.
  • subn: As sub, but also returns the number of substitutions made.

All of these methods take into account the groups in the regular expression, and the output can get very complicated if you have lots of groups in your regular expression. To answer the question of how to get 'fox' and 'dog', you would generally use findall or finditer, depending on the size of the text you are searching. You could also use match, using the previous match's end attribute to set the start parameter of the next search method call.

Throughout this tutorial, I have been using the paradigm of compiling a regular expression, and then using methods of regular expression to do the searches. There is another way to do it:

import re

text = 'The quick brown <em>fox</em> jumps over the lazy <em>dog</em>'
match = re.search('<em>(.+?)</em>', text)
Every method of the regular expression object is also a function in the re module that takes an uncompiled regular expression as it's first parameter. So if you are just doing one search, it might be simpler to just use the function. However, if you are searching on the same regular expression multiple time, it will be more efficient to compile the regular expression and use the compiled object's methods.

I haven't covered everything in the re module here, but the documentation is on-line. It not only has full details on the methods and functions discussed here, it has more funky things you can do with groups, and several examples to give you an even better idea of what you can do with regular expressions. If you are ever stumped working on a regular expression, doing a web search is a good idea. Regular expression have a long history before Python, and are implemented in a wide variety of computer languages. There's a good chance someone out on the Interwebs has wrestled with a problem similar to yours.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020