Ignorance is Bliss - non-memorizing parentheses

in Perl Tips, Regular Expressions
by William Ward on April 20, 2006 2:07 pm

One of regular expressions’ most useful features is memorization. To do this, just put parentheses around part of your expression and the result will be memorized:

my($name) = /hello, (\w+)/

In this example, we look in $_ for the word “hello” followed by a comma, space, and a word. Since the word, \w+, has parentheses around it, the part of the string that it matches gets memorized. In this example, we are assigning the return value of the regular expression match to $name. So if $_ contains “hello, world” then $name gets “world” - very convenient.

But parentheses also do other things besides memorize their contents, and this feature can become annoying. Here’s an example.

In a regular expression the | symbol indicates “or” - either the stuff to the left of it or the stuff to the right of it will match. For example, /hello|hi/ will match either “hello” or “hi” in the string. You can even have more than one of these: /hello|hi|howdy|greetings/ will match any of those four words.

The trouble is, what if you want the “or” to apply to only part of the string? That’s where parentheses come in. Let’s combine the previous two examples to show what I mean:

my($name) = /(hello|hi|howdy|greetings), (\w+)/

In this example, we want any of “hello,” “hi,” “howdy,” or “greetings”, followed by a comma, space, and a word which is memorized. The problem is, the greeting word is also memorized, and so $name gets that word instead of the name that we want it to get!

The easy solution is to allocate a variable for that word:

my($x, $name) = /(hello|hi|howdy|greetings), (\w+)/

But here, we don’t care about the value in $x so why bother allocating a variable for it? Can’t get just this one benefit of parentheses without having them memorize anything? For years, the answer was no. But then a few years back the Perl regex guys came up with a syntax to do it - just add ?: to the beginning of the parenthesized block, making it:

my($name) = /(?:hello|hi|howdy|greetings), (\w+)/

Gee, that was awfully obvious, wasn’t it? NOT! Why do they have to make these things so unintelligible? I hear you cry.

The answer is backward compatibility. Think about it - all the obvious characters already mean something, or if they don’t, chances are someone’s used them in a regular expression already to search for that character. So the only way to introduce a new feature into regular expressions is to use something that previously was a syntax error. Since the “?” character in a regex means “the previous thing zero or one times” and the thing before the “?” in this syntax is “(” (which if you recall means “start memorizing here”), it didn’t make sense to say “start memorizing here, zero or one times” so it was a syntax error. Since it was an error, nobody would have used it in an existing Perl script. So by giving (? a meaning that wasn’t a syntax error, backward compatibility is preserved.

But why “(?:” and not just “(?“? I wasn’t there, but I would assume they wanted to add more features to the parenthesized syntax and were running out of previously-bad syntax that they could give meaning to. For example, you may know that you can make a regex case-insensitive by adding /i to the end. Well, you can also insert the “i” between the “?” and “:” to make only part of the regex be case-insensitive: /(?i:hello|hi), world/ would allow “hello” or “HELLO” but “world” would have to be all lowercase.

So, the bottom line: if you find yourself wanting to use parentheses in your regex for reasons other than memorizing, and memorizing gets in your way (or you want to save a little on performance, since memorizing can slow things down a little), then just remember to insert ?: at the start of the parenthesized part of your pattern.

my($name) = /(?:hello|hi|howdy|greetings), (\w+)/

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment