Optimizing regular expressions in Java

A guide to pattern matching without the performance drag

1 2 3 4 Page 3
Page 3 of 4

Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression "\d{100}" is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.

Using benchmarks

After you have identified a possible improvement of a regular expression, even if you are certain that it will improve the speed, make a benchmark and compare the results against the previous expression. If the engine was able to internally optimize the previous expression better than the new one, it could lead to unexpected performance penalties.

For instance, the Java regex engine was not able to optimize the expression ".*abc.*". I expected it would search for "abc" in the input string and report a failure very quickly, but it didn't. On the same input string, using "String.indexOf("abc")" was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as ".{100}abc.*" the engine will match it more than ten times faster. Why? Because now the mandatory string "abc" is at a known position inside the string (there should be exactly one hundred characters before it).

Whenever you write complex regular expressions, try to find a way to write them such that the regex engine will be able to recognize and optimize for these particular situations. For instance, don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match, as shown in the example above.

Optimizing greedy and reluctant quantifiers

You have some basic ideas of how to optimize your regular expressions, as well as some of the ways you can let the regex engine do the work for you. Now let's talk about optimizing greedy and reluctant quantifiers. A greedy quantifier such as "*" or "+" will first try to match as many characters as possible from an input string, even if this means that the input string will not have sufficient characters left in it to match the rest of the regular expression. If this happens, the greedy quantifier will backtrack, returning characters until an overall match is found or until there are no more characters. A reluctant (or lazy) quantifier, on the other hand, will first try to match as few characters in the input string as possible.

So for example, say you want to optimize a sub-expression like ".*a". If the character a is located near the end of the input string it is better to use the greedy quantifier "*". If the character is located near the beginning of the input string it would be better to use the reluctant quantifier "*?" and change the sub-expression to ".*?a". Generally, I've noticed that the lazy quantifier is a little faster than its greedy counterpart.

Another tip is to be specific when writing a regular expression. Use general sub-constructs like ".*" sparingly because they can backtrack a lot, especially when the rest of the expression can't match the input string. For example, if you want to retrieve everything between two as in an input string, instead of using "a(.*)a", it's much better to use "a([^a]*)a".

Possessive quantifiers and independent grouping

Possessive quantifiers and independent grouping are the most useful operators for optimizing regular expressions. Use them whenever you can to dramatically improve the execution time of your expressions. Possessive quantifiers are denoted by the extra "+" sign, such as in the expression "X?+", "X*+", "X++". The notation for an independent grouping is "(?>X)".

I have successfully used both possessive quantifiers and independent grouping to reduce the execution time of regular expressions from a few minutes to a few seconds. Both operators are allowed to disable the backtracking behavior of the pattern-matching engine for the group to which they are applied. They will try to match their expression as any greedy quantifier would, but if they are able to match it, they will not give back what they have matched, even if this causes the overall regular expression to fail.

The difference between them is subtle. You can see it best by comparing the possessive quantifier "(X)*+" and the independent grouping "(?>X)*". In the former case, the possessive quantifier will disable backtracking for both the X sub-expression and the "*" quantifier. In the latter case, only backtracking for the X sub-expression will be disabled, while the "*" operator, being outside the group, is not affected by the independent grouping and is free to backtrack.

How would you optimize this regular expression?

Now let's consider an optimization example. Say you're trying to match the sub-expression "[^a]*a" on a long input string containing only the character b repeated many times. This expression will fail because the input string does not contain any instances of the character a. Because the pattern engine doesn't know this, it will try to match the expression "[^a]*". Because "*" is a greedy quantifier, it will grab all the characters until the end of the input string, and then it will backtrack, giving back one character at a time in the search for a match.

The expression will fail only when it can't backtrack anymore, which can take some time. Worse, because the "[^a]*" grabbed all characters that weren't a, even backtracking is useless.

1 2 3 4 Page 3
Page 3 of 4