Newsletter sign-up
View all newsletters

Sign up for our Enterprise Java Newsletter

Enterprise Java

Optimizing regular expressions in Java

A guide to pattern matching without the performance drag

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 2 of 4

The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically. I will discuss some of them later in the article. Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

Because of backtracking, regular expressions encountered in real-world application scenarios can sometimes take hours to completely match. Worse, it takes much longer for the engine to declare that a regular expression did not match an input string than it does to find a successful match. This is an important fact to remember. Whenever you want to test the speed of a regular expression, test it mostly on strings that it does not match. Among those, especially use strings that almost match, because those take the longest to complete.

Now let's consider some of the ways you can optimize your regular expressions for backtracking.

Simple ways to optimize regular expressions

Later in the article I'll get into the more involved ways you can optimize regular expressions in Java. To start, though, here are a few simple optimizations that could save you time:

  • If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches(). Not compiling the regular expression can be costly if Pattern.matches() is used over and over again with the same expression, for example in a loop, because the matches() method will re-compile the expression every time it is used. Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
  • Beware of alternation. Regular expressions like "(X|Y|Z)" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)" use "ab(cd|ef)". The latter is faster because the NFA will try to match ab and won't try any of the alternatives if it doesn't find it. (In this case there are only two alternatives. If there were many alternatives the gains in speed would be more impressive.) Alternation really can slow down your programs. The expression ".*(abcd|efgh|ijkl).*" was three times slower in my test than using three calls to String.indexOf(), one for each alternative in the regular expression.
  • Capturing groups incur a small-time penalty each time you use them. If you don't really need to capture the text inside a group, always use non-capturing groups. For example, use "(?:X)" instead of "(X)".

Let the engine do the work for you

As I mentioned before, the java.util.regex engine can optimize a regular expression several ways when it is compiled. For example, if the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comments (4)
Login
Forgot your account info?

Another thing, on the subject of NFA's.By Anonymous on January 25, 2010, 10:09 pmJava regular expressions, in general, cannot be represented using finite automata like NFA's, because they can use backreferences. That is why backtracking is...

Reply | Read entire comment

NFA is not backtracking.By Anonymous on January 25, 2010, 5:56 pmGo back and re-do computer science.

Reply | Read entire comment

lookaheads and behindBy Anonymous on October 20, 2009, 4:58 amGreat article, I haven't used lookaheads/behinds until yet.but there are enough situations where the expressions were too slow.

Reply | Read entire comment

regexBy Anonymous on October 9, 2009, 9:55 pmhey..hello just wanna ask when will you use regex..and please kindly give a simple instance or situation as to when you have to use it..tnx

Reply | Read entire comment

View all comments

Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources