Non-Greedy Regular Expressions

in Perl Tips, Regular Expressions
by William Ward on February 12, 2003 6:31 pm

Regular expressions in Perl are "greedy." That means that if you use a * or + operator in a regular expression, it grabs as much of the string as it can. This can be frustrating at times, but it’s useful in other respects. Consider this:

  my $ip_addr = "192.168.1.2";
  my ($network, $host) = ($ip_addr =~ /(.+).(.+)/);
  print "network=$network host=$hostn";

You need a way to know for sure whether $network gets "192.168.1" and $lastpart gets "2" or whether the split is "192" vs. "168.1.2". The decision was made to have it be "greedy" which means that the first + grabs the lion’s share, and the second one gets the leftovers. Put another way, the first one gets as much as possible short of making it impossible to match the string.

However sometimes you want to look at just the first part. Consider:

  my $fqdn = "lists.bayview.com";
  my ($host, $domain) = ($fqdn =~ /(.+).(.+)/);
  print "host=$host domain=$domainn";

This won’t work - We want $host to be "lists" and $domain to be "bayview.com" but instead, we get "lists.bayview" and "com". What’s the solution?

The old fashioned way was to change the first .+ to be more restrictive in the types of characters it will grab instead of "." which means "any character":

  my $fqdn = "lists.bayview.com";
  my ($host, $domain) = ($fqdn =~ /([^.]+).(.+)/);
  print "host=$host domain=$domainn";

The character class [^.] means "any character except ." which makes it stop right after "lists".

But there’s a more intuitive way of doing it now with newer versions of Perl. By adding ? after the + you make it "non-greedy". This means that the first .+ only grabs as little as possible short of making it not match:

  my $fqdn = "lists.bayview.com";
  my ($host, $domain) = ($fqdn =~ /(.+?).(.+)/);
  print "host=$host domain=$domainn";

The same thing works for * - just change * to *? and it is not greedy (it will happily match 0 characters).

2 Comments »

  1. The line

    my ($host, $domain) = ($fqdn =~ /(.+?).(.+)/);

    won’t do what you want. You need to escape the middle “.”, otherwise it will just match any character. The correct line would be:

    my ($host, $domain) = ($fqdn =~ /(.+?)\.(.+)/);

    Comment by apt — July 10, 2007 @ 2:54 pm

  2. Thanks, I was so busy thinking about the .+’s that I forgot that other dot.

    Comment by William Ward — August 9, 2007 @ 6:23 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment