Optimizing regular expressions in Java

A guide to pattern matching without the performance drag

1 2 3 4 Page 2
Page 2 of 4

The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically. I will discuss some of them later in the article. Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

Because of backtracking, regular expressions encountered in real-world application scenarios can sometimes take hours to completely match. Worse, it takes much longer for the engine to declare that a regular expression did not match an input string than it does to find a successful match. This is an important fact to remember. Whenever you want to test the speed of a regular expression, test it mostly on strings that it does not match. Among those, especially use strings that almost match, because those take the longest to complete.

Now let's consider some of the ways you can optimize your regular expressions for backtracking.

Simple ways to optimize regular expressions

Later in the article I'll get into the more involved ways you can optimize regular expressions in Java. To start, though, here are a few simple optimizations that could save you time:

  • If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches(). Not compiling the regular expression can be costly if Pattern.matches() is used over and over again with the same expression, for example in a loop, because the matches() method will re-compile the expression every time it is used. Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
  • Beware of alternation. Regular expressions like "(X|Y|Z)" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)" use "ab(cd|ef)". The latter is faster because the NFA will try to match ab and won't try any of the alternatives if it doesn't find it. (In this case there are only two alternatives. If there were many alternatives the gains in speed would be more impressive.) Alternation really can slow down your programs. The expression ".*(abcd|efgh|ijkl).*" was three times slower in my test than using three calls to String.indexOf(), one for each alternative in the regular expression.
  • Capturing groups incur a small-time penalty each time you use them. If you don't really need to capture the text inside a group, always use non-capturing groups. For example, use "(?:X)" instead of "(X)".

Let the engine do the work for you

As I mentioned before, the java.util.regex engine can optimize a regular expression several ways when it is compiled. For example, if the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.

1 2 3 4 Page 2
Page 2 of 4