Using Look-ahead and Look-behind
Roy Johnson
created: 2005-12-21 16:57:23
If you are familiar with Perl's regular expressions, you are probably already familiar with zero-width assertions: the ^ indicating the beginning of string and the \b indicating a word boundary are examples. They do not match any characters, but "look around" to see what comes before and/or after the current position.

With the look-ahead and look-behind constructs documented in perlre.html#Extended-Patterns, you can "roll your own" zero-width assertions to fit your needs. You can look forward or backward in the string being processed, and you can require that a pattern match succeed (positive assertion) or fail (negative assertion) there.

Syntax

Every extended pattern is written as a parenthetical group with a question mark as the first character. The notation for the look-arounds is fairly mnemonic, but there are some other, experimental patterns that are similar, so it is important to get all the characters in the right order.
(?=pattern)
is a positive look-ahead assertion
(?!pattern)
is a negative look-ahead assertion
(?<=pattern)
is a positive look-behind assertion
(?pattern)
is a negative look-behind assertion
Notice that the = or ! is always last. The directional indicator is only present in the look-behind, and comes before the positive-negative indicator.

Common tasks

Finding the last occurrence

There are actually a number of ways to get the last occurrence that don't involve look-around, but if you think of "the last foo" as "foo that isn't followed by a string containing foo", you can express that notion like this:
/foo(?!.*foo)/
The regular expression engine will do its best to match .*foo, starting at the end of the string "foo". If it is able to match that, then the negative look-ahead will fail, which will force the engine to progress through the string to try the next foo.

Substituting before, after, or between characters

Many substitutions match a chunk of text and then replace part or all of it. You can often avoid that by using look-arounds. For example, if you want to put a comma after every foo:
s/(?<=foo)/,/g; # Without lookbehind: s/foo/foo,/g or s/(foo)/$1,/g
or to put the hyphen in look-ahead:
s/(?<=look)(?=ahead)/-/g;
This kind of thing is likely to be the bulk of what you use look-arounds for. It is important to remember that look-behind expressions cannot be of variable length. That means you cannot use quantifiers (., +, or {1,5}) or alternation of different-length items inside them.

Matching a pattern that doesn't include another pattern

You might want to capture everything between foo and bar that doesn't include baz. The technique is to have the regex engine look-ahead at every character to ensure that it isn't the beginning of the undesired pattern:
/foo  # Match starting at foo
 (?:       # Complex expression:
   (?!baz) #   make sure we're not at the beginning of baz 
   .       #   accept any character
 )*        # any number of times
 bar  # and ending at bar
/x;

Nesting

You can put look-arounds inside of other look-arounds. This has been known to induce a flight response in certain readers (me, for example, the first time I saw it), but it's really not such a hard concept. A look-around sub-expression inherits a starting position from the enclosing expression, and can walk all around relative to that position without affecting the position of the enclosing expression. They all have independent (though initially inherited) bookkeeping for where they are in the string.

The concept is pretty simple, but the notation becomes hairy very quickly, so commented regular expressions are recommended. Let's look at the real example of [id://319742]. The poster wants to put a space after any comma (punctuation, actually, but for simplicity, let's say comma) that is not nestled between two digits. Building up the s/// expression:

s/(?<=,        # after a comma,
    (?!        # but not matching
      (?<=\d,) #   digit-comma before, AND
      (?=\d)   #   digit afterward
    )
  )/ /gx;      # substitute a space
Note that multiple lookarounds can be used to enforce multiple conditions at the same place, like an AND condition that complements the alternation (vertical bar)'s OR. In fact, you can use Boolean algebra ( NOT (a AND b) === (NOT a OR NOT b) ) to convert the expression to use OR:
s/(?<=,        # after a comma, but either
    (?:
      (?

Capturing

It is sometimes useful to use capturing parentheses within a look-around. You might think that you wouldn't be able to do that, since you're just browsing, but [478043|you can]. But remember: the capturing parentheses must be within the look-around expression; from the enclosing expression's point of view, no actual matching was done by the zero-width look-around.

perlmonks.org content © perlmonks.org and Roy Johnson

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03