Java 101: Regular expressions in Java, Part 2

Simplify common coding tasks with the Regex API

1 2 3 Page 2
Page 2 of 3
Remove the duplicate whitespace.

Capturing group-oriented methods

The source code for the RegexDemo application includes an m.group() method call. The group() method is one of several capturing group-oriented Matcher methods:

  • int groupCount() returns the number of capturing groups in a matcher's pattern. This count doesn't include the special capturing group number 0, which denotes the entire pattern.
  • String group() returns the previous match's characters. This method returns an empty string to indicate a successful match against the empty string. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • String group(int group) resembles the previous method, except that it returns the previous match's characters as recorded by the capturing group number that group specifies. Note that group(0) is equivalent to group(). If no capturing group with the specified group number exists in the pattern, the method throws IndexOutOfBoundsException. It throws IllegalStateException when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • String group(String name) returns the previous match's characters as recorded by the named capturing group. If there is no capturing group in the pattern with the given name, IllegalArgumentException is thrown. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.

The following example demonstrates the groupCount() and group(int group) methods:

Pattern p = Pattern.compile("(.(.(.)))");
Matcher m = p.matcher("abc");
m.find();
System.out.println(m.groupCount());
for (int i = 0; i <= m.groupCount(); i++)
   System.out.println(i + ": " + m.group(i));

It results in the following output:

3
0: abc
1: abc
2: bc
3: c

Match-position methods

Matcher provides several methods that return the start and end indexes of a match:

  • int start() returns the previous match's start index. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • int start(int group) resembles the previous method, except that it returns the previous match's start index associated with the capturing group that group specifies. If no capturing group with the specified capturing group number exists in the pattern, IndexOutOfBoundsException is thrown. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • int start(String name) resembles the previous method, except that it returns the previous match's start index associated with the capturing group that name specifies. If no capturing group with the specified name exists in the pattern, IllegalArgumentException is thrown. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • int end() returns the index of the last matched character plus one in the previous match. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • int end(int group) resembles the previous method, except that it returns the previous match's end index associated with the capturing group that group specifies. If no capturing group with the specified group number exists in the pattern, IndexOutOfBoundsException is thrown. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.
  • int end(String name) resembles the previous method, except that it returns the previous match's end index associated with the capturing group that name specifies. If no capturing group with the specified name exists in the pattern, IllegalArgumentException is thrown. IllegalStateException is thrown when either the matcher hasn't yet attempted a match or the previous match operation failed.

The following example demonstrates two of the match-position methods reporting start/end match positions for capturing group number 2:

Pattern p = Pattern.compile("(.(.(.)))");
Matcher m = p.matcher("abcabcabc");
while (m.find())
{
   System.out.println("Found " + m.group(2));
   System.out.println("  starting at index " + m.start(2) +
                      " and ending at index " + (m.end(2) - 1));
   System.out.println();
}

This example produces the following output:

Found bc
  starting at index 1 and ending at index 2
Found bc
  starting at index 4 and ending at index 5
Found bc
  starting at index 7 and ending at index 8

PatternSyntaxException methods

An instance of the PatternSyntaxException class describes a syntax error in a regex. This exception is thrown from Pattern's compile() and matches() methods, and is constructed via the following constructor:

PatternSyntaxException(String desc, String regex, int index)

The constructor stores the specified description, regex, and index where the syntax error occurs in the regex. The index is set to -1 when the syntax error location isn't known.

Although you'll probably never need to instantiate PatternSyntaxException, you will need to extract the aforementioned values when creating a formatted error message. Invoke the following methods to accomplish this task:

  • String getDescription() returns the syntax error's description.
  • int getIndex() returns either the approximate index (within a regex) where the syntax error occurs or -1 when the index is unknown.
  • String getPattern() returns the erroneous regex.

Additionally, the inherited String getMessage() method returns a multiline string containing the values returned from the aforementioned methods along with a visual indication of the syntax error position in the pattern.

What constitutes a syntax error? Here's an example:

java RegexDemo (?itree Treehouse

In this case we've failed to specify the closing parenthesis metacharacter ()) in the embedded flag expression. The error results in the following output:

regex = (?itree
input = Treehouse
Bad regex: Unknown inline modifier near index 3
(?itree
   ^
Description: Unknown inline modifier
Index: 3
Incorrect pattern: (?itree

Build useful regex-oriented applications with the Regex API

Regexes let you create powerful text-processing applications. This section presents a pair of useful applications that invite you to further explore the classes and methods in the Regex API. The second application also introduces Lexan: a reusable library for performing lexical analysis.

Regex for documentation

Documentation is one of the necessary tasks of developing professional quality software. Fortunately, regex can help with many aspects of documentation. The code in Listing 1 extracts the lines containing single-line and multiline C-style comments from one source file to another. Comments must be located on a single line for the code to work:

Listing 1. Extracting comments

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

public class ExtCmnt
{
   public static void main(String[] args)
   {
      if (args.length != 2)
      {
         System.err.println("usage: java ExtCmnt infile outfile");
         return;
      }

      Pattern p;
      try
      {
         // The following pattern defines multiline comments that appear on the
         // same line (e.g., /* same line */) and single-line comments (e.g., //
         // some line). The comment may appear anywhere on the line.

         p = Pattern.compile(".*/\\*.*\\*/|.*//.*$");
      }
      catch (PatternSyntaxException pse)
      {
         System.err.printf("Regex syntax error: %s%n", pse.getMessage());
         System.err.printf("Error description: %s%n", pse.getDescription());
         System.err.printf("Error index: %s%n", pse.getIndex());
         System.err.printf("Erroneous pattern: %s%n", pse.getPattern());
         return;
      }

      try (FileReader fr = new FileReader(args[0]);
           BufferedReader br = new BufferedReader(fr);
           FileWriter fw = new FileWriter(args[1]);
           BufferedWriter bw = new BufferedWriter(fw))
      {
         Matcher m = p.matcher("");
         String line;
         while ((line = br.readLine()) != null)
         {
            m.reset(line);
            if (m.matches()) /* entire line must match */
            {
               bw.write(line);
               bw.newLine();
            }
         }
      }
      catch (IOException ioe)
      {
         System.err.println(ioe.getMessage());
         return;
      }
   }
}

Listing 1's main() method first validates its command line and then compiles a regex for locating single-line and multiline comments into a Pattern object. Assuming no PatternSyntaxException arises, main() opens the source file and creates the target file, obtains a matcher to match each read line against the pattern, and reads the source file's contents line by line. For each line, the matcher tries to match the line against the comment pattern. If there's a match, main() writes the line (followed by a new-line) to the target file. (We'll explore file I/O logic in a future Java 101 tutorial.)

Compile Listing 1 as follows:

javac ExtCmnt.java

Run the application against ExtCmnt.java:

java ExtCmnt ExtCmnt.java out

You should observe the following output in the out file:

         // The following pattern defines multiline comments that appear on the
         // same line (e.g., /* same line */) and single-line comments (e.g., //
         // some line). The comment may appear anywhere on the line.
         p = Pattern.compile(".*/\\*.*\\*/|.*//.*$");
            if (m.matches()) /* entire line must match */

In the ".*/\\*.*\\*/|.*//.*$" pattern string, the vertical bar metacharacter (|) acts as a logical OR operator telling a matcher to use that operator's left regex construct operand to locate a match in the matcher's text. If no match exists, the matcher uses that operator's right regex construct operand in another match attempt. (The parentheses metacharacters in a capturing group form another logical operator.)

Regex for lexical analysis

An even more useful application of regexes is a reusable library for performing lexical analysis, a key component of any code compiler or assembler. In this case, an input stream of characters is grouped into an output stream of tokens, which are names representing sequences of characters that have a collective meaning. For example, upon encountering the letter sequence c, o, u, n, t, e, r in the input stream, a lexical analyzer might output token ID (identifier). The character sequence associated with the token is known as the lexeme.

Regexes are much more efficient than state-based lexical analyzers, which must be written by hand and are typically not reusable. An example of a regex-based lexical analyzer is JLex, the lexical generator for Java, which relies on regexes to specify the rules for breaking an input stream into tokens. Another example is Lexan.

Getting to know Lexan

Lexan is a reusable Java library for lexical analysis. It's based on code in the Cogito Learning website's Writing a Parser in Java blog series. The library consists of the following classes, which you will find in the ca.javajeff.lexan package included with the source download for this article:

  • Lexan: the lexical analyzer
  • LexanException: an exception arising from Lexan's constructor
  • LexException: an exception arising from bad syntax during lexical analysis
  • Token: a name with a regex attribute
  • TokLex: a token/lexeme pair

The Lexan(java.lang.Class<?> tokensClass) constructor creates a new lexical analyzer. It requires a single java.lang.Class object argument denoting a class of static Token constants. Using the Reflection API, the constructor reads each Token constant into a Token[] array of values. If no Token constants are present, LexanException is thrown.

Lexan also provides the following pair of methods:

  • List<TokLex> getTokLexes() returns this lexical analyzer's list of TokLexes.
  • void lex(String str) lexes an input string into a list of TokLexes. LexException is thrown if a character is encountered that doesn't match any of the Token[] array's patterns.

LexanException provides no methods, but relies on its inherited getMessage() method to return the exception's message. In contrast, LexException also provides the following methods:

  • int getBadCharIndex() returns the index of the character that didn't match any token patterns.
  • String getText() returns the text that was being lexed when the exception occurred.

Token overrides the toString() method to return the token's name. It also provides a String getPattern() method that returns the token's regex attribute.

TokLex provides a Token getToken() method that returns its token. It also provides a String getLexeme() method that returns its lexeme.

1 2 3 Page 2
Page 2 of 3