Regular expressions are very useful and, while they may appear complicated at first, are actually quite simple. They can become quite long and appear daunting, but by understand the syntax, any expression can be broken down into small components that are easy to understand.
So, what is a regular expression? Very simply, it is a pattern used to match strings of text. Everything else is just syntax. For example, we may have a pattern such as:
^[A-J]{2,3}\s?\d+$
This will match “AG 2010”, but will not match “AND Me”.
Stop staring at the regular expression above. If you don’t understand it, that’s OK, you’re not supposed to, otherwise you wouldn’t be reading this. If you do understand it – and REALLY understand, no guessing – stop reading, you’re wasting your time.
What Can A Regular Expression Do For Me?
One primary use of regular expressions is in computer programming. Most modern programming languages have implemented some degree of support for regular expressions. Notable among these is Perl, well known for its extensive integration of and extensions to regular expressions. Depending on the language, there may be some syntactic differences or additions. However, most implementations support a basic set of standard syntax rules. All of the syntax discussed here should be applicable to nearly every implementation.
Not a programmer? That’s OK, you may still find some use for regular expressions. An increasingly visible application of regular expressions is in support of Find-and-Replace functionality in text & document editors. Static strings can be great when looking for static text, but what if you want to locate something with a bit of variance, say, all phone numbers or mail codes in a document? A static string just won’t work in such cases, but a regular expression can very easily be employed for such a task. If your current text editor does not support regular expressions, you may want try one that does after reading this to test out your regular expression skills and see how useful they can be. Even though I do a fair bit of programming, I probably get the most use out of regular expressions with my favorite text editor, Notepad++, which, besides having long supported regular expressions in its Find and Replace functionality, has an incredibly comprehensive set of additional features that have made it indispensable for me, and it’s entirely free.
Anatomy of a Regular Expression
There are two parts to a regular expression – what we want to match, and the quantity to match. Let’s start with what we want to match.
1. What To Match
What to match can be broken down into 3 categories: the actual character(s) to match, grouping of characters, and positional matches.
In its simplest and most literal form, we can explicitly specify literal characters to match. For example:
And
is a regular expression. This matches a capital A, followed by a lowercase ‘n’, followed by a lowercase ‘d’. It’s not all that useful, but it is a valid regular expression. It is important to note that, as this example demonstrates, regular expressions are case sensitive.
1.1 Grouping
More likely though, we will want to match one of several characters, which is where grouping comes in. The most common form of grouping is the character class, a series of characters within square brackets. For example:
[ABCDEnr]
means ‘match a capital A, capital B, capital C, capital D, capital E, lowercase n OR lowercase r’. But what if I want to match any uppercase letter? I’d rather not have to type 26 letters. Not to worry, as multiple consecutive characters can be expressed as a range, specified as two characters separated by a dash, like so:
[A-Enr]
This is equivalent to the previous expression, and can be read as ‘match any capital letter from A to E, lowercase n OR lowercase r’. Note that the use of the dash to express a range of characters is only valid in a character class; outside of a character class, a dash simply matches a dash. Also, if the dash is the first character in the character class, it will match a literal dash character.
In some cases, it may be simpler to specify those characters you don’t want to match. Suppose you want to match any character except an underscore or dollar sign. This can be expressed as:
[^_$]
The carat, when it is the first character within a character class, negates the character class, changing the meaning of the character class to ‘match any character EXCEPT those listed in this character class’. Note that, to negate a character class, the carat must be the first character within the square brackets; anywhere else in the character class, the carat matches a literal carat character.
1.2 Character Shorthand
To simplify character matching even further, there are a number of shorthand expressions that can be used to match certain types of characters. The most common of these are defined in the following table.
| Expression |
Matches… |
| \s |
Any white-space character. Equivalent to [\f\n\r\t\v].
|
| \S |
Any non-white-space character. Equivalent to [^\f\n\r\t\v].
|
| \d |
any decimal digit character. Equivalent to [0-9] |
| \D |
Any non-decimal digit character. Equivalent to [^0-9].
|
| \w |
Any word character. Equivalent to [a-zA-Z_0-9].
|
| \W |
Any non-word character. Equivalent to [^a-zA-Z_0-9].
|
| . |
Any single character except new line (\n).
|
Table 1.2 character shorthand expressions
The shorthand expressions listed above are effective both inside and outside a character class, with the exception of the dot (.). Within a character class, the dot matches a literal dot character.
1.3 Positional Matches
Suppose we want to find the first page of a document by matching the page number, where the page numbers are in the format ‘Page N’. We might use the following regular expression:
Page 1
Nothing fancy, we are matching a literal string, and this may seem to work well enough at first. However, what happens when we get to Page 10? In most cases, this will also be a match for our expression above. Why is this so? A regular expression does a character by character match, so the above expression could be read as ‘Match a capital P, followed by a lowercase a, followed by a lowercase g, followed by a lowercase e, followed by a space, followed by a numeral 1’. Notice this does not account for anything following the match, so this will match ‘Page 1’, ‘Page 10’, or ‘Page 1983467231894’
To address such cases, positional match characters are provided. The dollar sign ($) matches the position at the end of a string. Adding this to our expression above, we have:
Page 1$
This can be read as ‘Match a capital P, followed by a lowercase a, followed by a lowercase g, followed by a lowercase e, followed by a space, followed by a numeral 1, with no further characters in the string’.
Similarly, we can match the position at the beginning of a string using the carat (^). Recall that the carat negates a character class if it is the first character after the opening square bracket. To match the beginning of a string, the carat must be outside of any character class.
Note that some regular expression implementations provide configuration options to force the expression to match the given string in its entirety. This is equivalent to automatically prepending the carat and appending the dollar sign to the specified expression.
2. Quantifiers
OK, so we’ve looked at character matching, but except for our very first literal expression, we haven’t matched more than a single character. One way we can match multiple characters is to string together two or more expressions like those we’ve seen so far. For example:
[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z]
This will match any five consecutive capital alpha characters. So, this would match ‘SAMPLE’ but not ‘Sample’. However, our expression is going to become very large very quickly if we have to define each character individually.
We can specify the number of times we want to match a given character by placing the quantity within curly braces following the character expression. Using this method, an equivalent to the previous expression would be:
[A-Z]{5}
Suppose, in addition to matching 5 capital letters, we’d also like to match if there are only 3 or 4 capital letters. Again, we specify the quantity in curly braces following the character, but we specify two numbers, the first being the minimum number of instances of the previous character to match, the second being the maximum, separated by a comma. So, to match 3,4 or 5 capital characters, we write:
[A-Z]{3,5}
Using this notation, we can also specify no minimum or no maximum number of characters to match by leaving the relevant value blank. So:
[A-Z]{3,}
matches 3 or more capital letters, while
[A-Z]{,5}
matches 5 or fewer capital letters.
2.1 Quantifier Shorthand
Similar to the character shorthand expressions discussed previously, there are a number of quantifier shorthand expressions that make common quantities simple to express. These are defined in the following table.
| Expression |
Matches the previous expression/character… |
| * |
0 or more times; equivalent to {0, }.
|
| ? |
1 or more times; equivalent to {1, }.
|
| + |
0 or 1 times; equivalent to {0,1}
|
Table 2.1 quantifier shorthand expressions
3. How A Regular Expression is Applied
By default, regular expressions are said to be ‘greedy’, meaning any given element in an expression will match as much of the string to which it is applied as possible. Take for instance the very common expression:
.*
This means ‘match any given character 0 or more times’, which essentially matches anything and everything, with the exception of new line characters. Matched characters are said to be ‘consumed’, meaning that, once matched, they are discard and are not subject to any further processing.
If this is the case, what happens with an expression such as:
.*e
Suppose we apply this to the string ‘Page’. Given what we know so far, the ‘.*’ portion of our regular expression should match ‘Page’, consuming the entire string and leaving nothing for the ‘e’ to match. If this is the case, we might suppose that nothing but a new line character could ever be matched after ‘.*’ in a regular expression, as it will consume all other characters.
However, this is not the case, due to a process called ‘backtracing’. Backtracing essentially allows a regular expression to ‘back up’ and look for a match among the characters already consumed. So, when applying our expression of ‘.*e’ to the string ‘Page’, the process is as follows:
- .* is applied to ‘Page’, matching ‘Page’.
- ‘e’ now must be matched. The regular expression engine backs up in the string, character by character, until a match is found. As the last letter in our string is a match, the ‘e’ in our expression matches the ‘e’ in ‘Page’, and subsequently, our first part of the expression, .*, matches ‘Pag’.
That’s all for now, we’ll cover more in Part 2. However, with what you now know, you should be able to start writing & understanding most regular expressions.
Recent Comments