Page 2 of 4
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically. I will discuss some of them later in the article. Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.
Because of backtracking, regular expressions encountered in real-world application scenarios can sometimes take hours to completely match. Worse, it takes much longer for the engine to declare that a regular expression did not match an input string than it does to find a successful match. This is an important fact to remember. Whenever you want to test the speed of a regular expression, test it mostly on strings that it does not match. Among those, especially use strings that almost match, because those take the longest to complete.
Now let's consider some of the ways you can optimize your regular expressions for backtracking.
Later in the article I'll get into the more involved ways you can optimize regular expressions in Java. To start, though, here are a few simple optimizations that could save you time:
Pattern.compile() instead of the more direct Pattern.matches(). Not compiling the regular expression can be costly if Pattern.matches() is used over and over again with the same expression, for example in a loop, because the matches() method will re-compile the expression every time it is used. Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
(X|Y|Z)" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more
common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)" use "ab(cd|ef)". The latter is faster because the NFA will try to match ab and won't try any of the alternatives if it doesn't find it. (In this case there are only two alternatives. If there were
many alternatives the gains in speed would be more impressive.) Alternation really can slow down your programs. The expression
".*(abcd|efgh|ijkl).*" was three times slower in my test than using three calls to String.indexOf(), one for each alternative in the regular expression.
(?:X)" instead of "(X)".
As I mentioned before, the java.util.regex engine can optimize a regular expression several ways when it is compiled. For example, if the regular expression contains
a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search
that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
| Subject |
|
|
java.util.regex package and demonstrates a practical application of regular expressions.
Pattern.matches throws StackOverFlow") is a known bug in the java.util.regex package since Java 1.4.
java.util.regex to learn more about the Pattern and Matcher classes and for a summary of regular-expression constructs.
Another thing, on the subject of NFA's.By Anonymous on January 25, 2010, 10:09 pmJava regular expressions, in general, cannot be represented using finite automata like NFA's, because they can use backreferences. That is why backtracking is...
Reply | Read entire comment
NFA is not backtracking.By Anonymous on January 25, 2010, 5:56 pmGo back and re-do computer science.
Reply | Read entire comment
lookaheads and behindBy Anonymous on October 20, 2009, 4:58 amGreat article, I haven't used lookaheads/behinds until yet.but there are enough situations where the expressions were too slow.
Reply | Read entire comment
regexBy Anonymous on October 9, 2009, 9:55 pmhey..hello just wanna ask when will you use regex..and please kindly give a simple instance or situation as to when you have to use it..tnx
Reply | Read entire comment
View all comments