Newsletter sign-up
View all newsletters

Sign up for our Enterprise Java Newsletter

Enterprise Java

Optimizing regular expressions in Java

A guide to pattern matching without the performance drag

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

If you've struggled with regular expressions that took hours to match when you needed them to complete in seconds, this article is for you. Java developer Cristian Mocanu explains where and why the regex pattern-matching engine tends to stall, then shows you how to make the most of backtracking rather than getting lost in it, how to optimize greedy and reluctant quantifiers, and why possessive quantifiers, independent grouping, and lookarounds are your friends.

Writing a regular expression is more than a skill -- it's an art.
-- Jeffrey Friedl

In this article I introduce some of the common weaknesses in regular expressions using the default java.util.regex package. I explain why backtracking is both the foundation of pattern matching with regular expressions and a frequent bottleneck in application code, why you should exercise caution when using greedy and reluctant quantifiers, and why it is essential to benchmark your regex optimizations. I then introduce several techniques for optimizing regular expressions, and discuss what happens when I run my new expressions through the Java pattern-matching engine.

For the purpose of this article I assume that you already have some experience using regular expressions and are most interested in learning how to optimize them in Java code. Topics covered include simple and automated optimization techniques as well as how to optimize greedy and reluctant quantifiers using possessive quantifiers, independent grouping, and lookarounds. See the Resources section for an introduction to regular expressions in Java.

Notation
I use double quotes ("") to delimit regular expressions and input strings, X, Y, Z to denote regular sub-expressions or a portion of a regular expression, and a, b, c, d (et-cetera) to denote single characters.

The Java pattern-matching engine and backtracking

The java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton, or NFA. It's called nondeterministic because while trying to match a regular expression on a given string, each character in the input string might be checked several times against different parts of the regular expression. This is a widely used type of engine also found in .NET, PHP, Perl, Python, and Ruby. It puts great power into the hands of the programmer, offering a wide range of quantifiers and other special constructs such as lookarounds, which I'll discuss later in the article.

At heart, the NFA uses backtracking. Usually there isn't only one way to apply a regular expression on a given string, so the pattern-matching engine will try to exhaust all possibilities until it declares failure. To better understand the NFA and backtracking, consider the following example:

The regular expression is "sc(ored|ared|oring)x" The input string is "scared"

First, the engine will look for "sc" and find it immediately as the first two characters in the input string. It will then try to match "ored" starting from the third character in the input string. That won't match, so it will go back to the third character and try "ared". This will match, so it will go forward and try to match "x". Finding no match there, it will go back again to the third character and search for "oring". This won't match either, and so it will go back to the second character in the input string and try to search for another "sc". Upon reaching the end of the input string it will declare failure.

Optimization tips for backtracking

With the above example you've seen how the NFA uses backtracking for pattern matching, and you've also discovered one of the problems with backtracking. Even in the simple example above the engine had to backtrack several times while trying to match the input string to the regular expression. It's easy to imagine what could happen to your application performance if backtracking got out of hand. An important part of optimizing a regular expression is minimizing the amount of backtracking that it does.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comments (4)
Login
Forgot your account info?

Another thing, on the subject of NFA's.By Anonymous on January 25, 2010, 10:09 pmJava regular expressions, in general, cannot be represented using finite automata like NFA's, because they can use backreferences. That is why backtracking is...

Reply | Read entire comment

NFA is not backtracking.By Anonymous on January 25, 2010, 5:56 pmGo back and re-do computer science.

Reply | Read entire comment

lookaheads and behindBy Anonymous on October 20, 2009, 4:58 amGreat article, I haven't used lookaheads/behinds until yet.but there are enough situations where the expressions were too slow.

Reply | Read entire comment

regexBy Anonymous on October 9, 2009, 9:55 pmhey..hello just wanna ask when will you use regex..and please kindly give a simple instance or situation as to when you have to use it..tnx

Reply | Read entire comment

View all comments

Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources