Optimizing regular expressions in Java

A guide to pattern matching without the performance drag

1 2 3 4 Page 4
Page 4 of 4

The solution is to change the expression "[^a]*a" to "[^a]*+a" using the possessive quantifier "*+". This new expression fails faster because once it has tried to match all the characters that are not a it doesn't backtrack; instead it fails right there.

Lookaround constructs

If you want to write a regular expression that matches any character except some, you could easily write something like "[^abc]*" which means: Match any characters except a or b or c. But what if you wanted it to match strings like "cab" or "cba", but not "abc"?

For this you could use the lookaround constructs. The java.util.regex package has four of them:

  • Positive lookahead: "(?=X)"
  • Negative lookahead: "(?!X)"
  • Positive lookbehind: "(?<=X)"
  • Negative lookbehind: "(?<!X)"

The word positive in this case means that you want the expression to match, while the word negative means that you don't want the expression to match. Lookahead means that you want to search to the right of your current position in the input string. Lookbehind means that you want to search to the left. Remember that the lookaround constructs only peek forward or backward; they don't actually change the current position in the input string. That said, you could use something like "((?!abc).)*" using the negative lookahead operator "?!" to match any sequence of characters but not "abc" in the given order.

Lookarounds in practice

Lookaround constructs help you to be more specific when writing regular expressions, which can have a big affect on matching performance. Listing 1 shows a very common example: using a regular expression to match HTML fields.

Listing 1. Matching HTML fields

Regular expression: "<img.*src=(\S*)/>"
Input string 1: "<img border=1 src=image.jpg />"
Input string 2: "<img src=src=src=src= .... many src= ... src=src="

With the regular expression in Listing 1, the goal is to match the contents of the "src" attribute from an HTML image tag. I especially simplified the expression, assuming that there will be no other attributes after "src", to be able to focus on its performance aspects.

Why not be lazy?

You might be thinking that I could have used the reluctant quantifier ".*?" to optimize the regular expression in Listing 1. In fact, "<img.*?src=(.*)/>" would easily match the first-encountered "src=". This solution works for cases where the regular expression matches. If it didn't match the input string, however, it would start to backtrack and would take just as long to match as the greedy quantifier. Remember to always test your regular expressions using non-matching strings first!

The expression is fast enough when matching the input "string 1", but it takes a very long time to declare failure in its attempt to match the input "string 2 (time growing exponentially with the length of the input string). It fails because there is no "/>" at the end of the input string. To optimize this expression, look at the first ".*" construct. It is supposed to match any attributes that come before "src" but is too generic and it matches too much. In fact, the construct should only match any attributes except "src".

The rewritten expression "<img((?!src=).)*src=(\S*)/>" will handle a large, non-matching string almost a hundred times faster then the previous one!

A note about the StackOverflowError

Sometimes the regex Pattern class will throw a StackOverflowError. This is a manifestation of the known bug #5050507, which has been in the java.util.regex package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See the description of the bug for more details. It seems it's triggered mostly by the use of alternations.

If you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.

In conclusion

Regular expressions shouldn't take hours to match, especially for applications that only have seconds to spare. In this article I've introduced some of the weak points of the java.util.regex package and shown you how to work around them. Simple bottlenecks like backtracking just require a little finesse whereas culprits like greedy and reluctant quantifiers require more careful consideration. In some cases you can replace them completely, in others you simply have to "lookaround" them. Either way, you've learned some good tricks for coaxing speed out of your regular expressions.

Let me know what you think about the workarounds I've proposed, and be sure to share your optimizing tips with other JavaWorld readers in the <a href="http://www.javaworld.com/javaforums/newpost.php?Cat=0&Board=112069">discussion thread about optimizing regular expressions in Java</a>.

Cristian Mocanu is a Java team leader at 1&1 Internet AG, Romania. He is a Sun Certified Programmer, Business Component Developer, and Architect with more than five years experience working with enterprise Java.

Learn more about this topic

1 2 3 4 Page 4
Page 4 of 4