If you've struggled with regular expressions that took hours to match when you needed them to complete in seconds, this article is for you. Java developer Cristian Mocanu explains where and why the regex pattern-matching engine tends to stall, then shows you how to make the most of backtracking rather than getting lost in it, how to optimize greedy and reluctant quantifiers, and why possessive quantifiers, independent grouping, and lookarounds are your friends.
Writing a regular expression is more than a skill -- it's an art.
-- Jeffrey Friedl
In this article I introduce some of the common weaknesses in regular expressions using the default
java.util.regex package. I explain why backtracking is both the foundation of pattern matching with regular expressions and a frequent bottleneck in application code, why you should exercise caution when using greedy and reluctant quantifiers, and why it is essential to benchmark your regex optimizations. I then introduce several techniques for optimizing regular expressions, and discuss what happens when I run my new expressions through the Java pattern-matching engine.
For the purpose of this article I assume that you already have some experience using regular expressions and are most interested in learning how to optimize them in Java code. Topics covered include simple and automated optimization techniques as well as how to optimize greedy and reluctant quantifiers using possessive quantifiers, independent grouping, and lookarounds. See the Resources section for an introduction to regular expressions in Java.
|I use double quotes ("") to delimit regular expressions and input strings, X, Y, Z to denote regular sub-expressions or a portion of a regular expression, and a, b, c, d (et-cetera) to denote single characters.|
The Java pattern-matching engine and backtracking
java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton, or NFA. It's called nondeterministic because while trying to match a regular expression on a given string, each character in the input string might be checked several times against different parts of the regular expression. This is a widely used type of engine also found in .NET, PHP, Perl, Python, and Ruby. It puts great power into the hands of the programmer, offering a wide range of quantifiers and other special constructs such as lookarounds, which I'll discuss later in the article.
At heart, the NFA uses backtracking. Usually there isn't only one way to apply a regular expression on a given string, so the pattern-matching engine will try to exhaust all possibilities until it declares failure. To better understand the NFA and backtracking, consider the following example:
The regular expression is "
sc(ored|ared|oring)x" The input string is "
First, the engine will look for "
sc" and find it immediately as the first two characters in the input string. It will then try to match "
ored" starting from the third character in the input string. That won't match, so it will go back to the third character and try "
ared". This will match, so it will go forward and try to match "
x". Finding no match there, it will go back again to the third character and search for "
oring". This won't match either, and so it will go back to the second character in the input string and try to search for another "
sc". Upon reaching the end of the input string it will declare failure.
Optimization tips for backtracking
With the above example you've seen how the NFA uses backtracking for pattern matching, and you've also discovered one of the problems with backtracking. Even in the simple example above the engine had to backtrack several times while trying to match the input string to the regular expression. It's easy to imagine what could happen to your application performance if backtracking got out of hand. An important part of optimizing a regular expression is minimizing the amount of backtracking that it does.