Regular Expressions in Groovy (via Java)

The people who attended my presentations at RMOUG Training Days 2010 asked several good questions. One question that was asked in the Groovy presentation that I really wish I had included a slide on was "Does Groovy support regular expressions?" The focus of my presentation was on using Groovy for scripting, so this was a very natural and relevant question. I answered that Groovy uses Java's regular expression support and adds a few nifty features of its own to make regular expressions easier to apply. In this blog post, I briefly summarize what wish I had dedicated a slide on in that presentation. I do plan to have a slide on and cover Groovy regular expression support in my RMOUG Training Days 2011 presentation "Groovier Java Scripting."

One of my favorite software development quotes concerns regular expressions (see Source of the Famous 'Now You Have Two Problems' Quote for extensive background on this quote):

Some people, when confronted with a problem, think

"I know, I'll use regular expressions." Now they have two problems.

I personally have a sort of love/hate relationship with regular expressions. Regular expressions almost always allow for more concise code, but that does not always translate to "easier" or "more readable" code. There are times when I feel the regular expression solution is most concise, most elegant, and most readable and then there are the other times ...

There are a few things that make use of regular expressions seem difficult at times. If I use regular expressions regularly, I find myself increasingly fond of them. When I only use regular expressions sporadically, I don't find them as easy. Lack of familiarity with regular expressions syntax due to infrequent use often makes them more difficult to write and read (I typically find them easier to write than read). Another cause of the difficulty in reading and writing regular expressions stems from their very advantage: they can be almost too concise at times (especially when not used often). Finally, even when I get comfortable with regular expressions, it can be a minor annoyance to realize again (often the hard way) that there are multiple dialects of regular expressions and the different dialects have differing syntaxes (Java's regular expression is often said to be Perl-like). These differences make regular expressions less regular.

SIDE NOTE: One of the things I like about the book Regular Expressions Cookbook is that it lists which dialects (it calls them "flavors") of regular expressions work for each recipe (example) in the book. For example, Recipe 2.18 ("Add Comments to a Regular Expression") states that this particular recipe applies to the regular expression "flavors" of Java, Perl, Perl Compatible Regular Expressions, .NET, Python, and Ruby, but does not apply to the JavaScript flavor of regular expressions.

For those of us who used Java and languages (or Unix or Linux or vi) that supported regular expressions, it was welcome news when it was announced that Java would add regular expression support with JDK 1.4.

Although the addition of regular expressions to Java was welcome, Java's regular expression support is not always the easiest to apply due to language requirements of the Java Language Specification. In other words, Java language limitations add another layer of challenge to using regular expressions. Groovy, goes a long way toward reducing this extra complexity of regular expressions in Java.

The Java Tutorials's lesson on regular expressions introduces Java's support for regular expressions via the java.util.regex package and highlights the two classes Pattern and Matcher.

The Java Pattern is described in its Javadoc documentation as "a compiled representation of a regular expression." The documentation further explains that "a regular expression, specified as a string, must first be compiled into an instance of this class [Pattern]." A typical approach for accessing a compiled Pattern instance is to use Pattern p = Pattern.compile(""); with the relevant regular expression specified within the pair of double quotes. Groovy provide a shortcut here with the ~ symbol. Prefixing a String literal in Groovy with the ~ creates a Pattern instance from that String. This implies that Pattern p = Pattern.compile("a*b"); can be written in Groovy as def p = ~"a*b" (example used in Javadoc for Pattern).

The availability of ~ is a speck of syntactic sugar, but Groovy provides more regular expression sweetness than this. One of the least appealing parts of Java's regular expression support is the handling of backslashes within regular expressions. This is really more of a problem of backslash treatment in Java Strings. Groovy makes this much nicer to use by providing the ability to specify the regular expression used in a Pattern with "slashy syntax." This allows regular expressions to appear more natural than they do when the String must be made to comply with Java expectations.

I use the example provided by Regular Expressions Cookbook Recipe 3.1 ("Literal Regular Expressions in Source Code") to illustrate the advantages of Groovy in Pattern representation of a regular expressions. This recipe provides the literal regular expression string [$"'\n\d/\\] for its example and explains what this represents: "This regular expression consists of a single character class that matches a dollar sign, a double quote, a single quote, a line feed, any digit between 0 and 9, a forward slash, or a backslash." The only "escape" character in the entire regular expression is the need for two backslashes to represent that the character can be a single backslash.

As the Regular Expressions Cookbook recipe explains, this regular expression is represented in Java as "[$\"'\n\\d/\\\\]". Ignoring the double quotes on either side of the Java representation, it is still clear that Java String treatment forces the regular expression String [$"'\n\d/\\] to be represented in Java as [$\"'\n\\d/\\\\]. Note that the Java representation must add a backslash in front of the double quote that is part of the regular expression to escape it, must do the same thing for the \d that represents numeric digit, and then must provide four consecutive backslashes at the end to appropriately escape and represent the two that are actually meant for the regular expression. Regular expressions can be cryptic anyway and even the slightest typo can change everything, so the extra syntax needed for the Java version is more that can go wrong.

I demonstrate this example more completely with the following simple Java code.

// Regular Expression: [$"'\n\d/\\]
      //    For Java, must escape the double quote, the \d, and the \\
      final String regExCookbook31RegExString = "[$\"'\n\\d/\\\\]";
      final Pattern regExCookbook31Pattern = Pattern.compile(regExCookbook31RegExString);
      out.println(
           "The original regular expression is: "
         + regExCookbook31Pattern.pattern());

Running the above code leads to the output demonstrated in the next screen snapshot.

Before looking at how Groovy improves on the handling of the regular expression, I round out the Java example that was started above to also include an example of Java's Matcher class in action. The Matcher is obtained from the Pattern instance and supports three types of matching: (1) matching the entire provided sequence against the regular expression pattern [Matcher.matches()], (2) matching at least the beginning portion of the provided sequence against the regular expression pattern [Matcher.lookingAt()], and (3) iterating over the provided sequence looking for one or more pattern matches [Matcher.find()].

The third approach is demonstrated in the Java Tutorial on regular expressions. It provides a "test harness" that I have adapted here:

package dustin.examples;

import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

import static java.lang.System.err;

/**
 * Regular expression test harness slightly adapted from example provided in
 * Java Tutorial on regular expressions. The location of the original is
 * http://download.oracle.com/javase/tutorial/essential/regex/test_harness.....
 */
public class RegExTestHarness
{
   /**
    * Simple executable method that provides demonstration of Java's regular
    * expression support with java.util.regex package and its classes Pattern
    * and Matcher.  Adapted from Java Tutorial regular expressions test harness.
    *
    * @param arguments Command-line arguments: none expected.
    */
   public static void main(final String[] arguments)
   {
      final Console console = System.console();
      if (console == null)
      {
         err.println("No console available; this application requires it.");
         System.exit(-1);
      }
      String regExInput;
      do
      {
         regExInput = console.readLine("%nEnter your regular expression: ");
         final Pattern pattern = Pattern.compile(regExInput);

         final String searchStringInput =
            console.readLine("Enter input string to search with regular expression: ");
         final Matcher matcher = pattern.matcher(searchStringInput);

         boolean found = false;
         while (matcher.find())
         {
            console.format(
                 "Text \"%s\" located starting at "
               + "index %d and ending at index %d.%n",
               matcher.group(), matcher.start(), matcher.end());
            found = true;
         }
         if (!found)
         {
            console.format("No match found.%n");
         }
      } while (!regExInput.isEmpty());
   }
}

The next screen snapshot demonstrates the output from this adapted Java-based regular expression test harness on the regular expression used above.

This example demonstrates the Matcher.find() method in action: it iterates over the provided input String and returns true whenever it evaluates a character that satisfies the single character regular expression. The Java code then uses other methods on Matcher (Matcher.group(), Matcher.start(), and Matcher.end()) to provide more details on the match.

The Matcher.find() method is the best choice when one wants to find and act upon any regular expression matches in a given expression. However, if one is only interested in whether the given expression begins with a match to the regular expression of if the entire expression is an exact match of the regular expression, then Matcher.lookingAt() or Matcher.find() are likely to be preferred. The three methods in the next code listing can be used to demonstrate these two Matcher methods along with providing another example of Matcher.find().

Related:
1 2 Page 1