Java 101: Regular expressions in Java, Part 2

Simplify common coding tasks with the Regex API

1 2 3 Page 3
Page 3 of 3

Demonstrating Lexan

I've created a LexanDemo application that demonstrates the library. This application consists of LexanDemo, BinTokens, MathTokens, and NoTokens classes. Listing 2 presents LexanDemo's source code.

Listing 2. Demonstrating Lexan

import ca.javajeff.lexan.Lexan;
import ca.javajeff.lexan.LexanException;
import ca.javajeff.lexan.LexException;
import ca.javajeff.lexan.TokLex;

public final class LexanDemo
{
   public static void main(String[] args)
   {
      lex(MathTokens.class, " sin(x) * (1 + var_12) ");
      lex(BinTokens.class, " 1 0 1 0 1");
      lex(BinTokens.class, "110");
      lex(BinTokens.class, "1 20");
      lex(NoTokens.class, "");
   }

   private static void lex(Class<?> tokensClass, String text)
   {
      try
      {
         Lexan lexan = new Lexan(tokensClass);
         lexan.lex(text);
         for (TokLex tokLex: lexan.getTokLexes())
            System.out.printf("%s: %s%n", tokLex.getToken(), 
                              tokLex.getLexeme());
      }
      catch (LexanException le)
      {
         System.err.println(le.getMessage());
      }
      catch (LexException le)
      {
         System.err.println(le.getText());
         for (int i = 0; i < le.getBadCharIndex(); i++)
            System.err.print("-");
         System.err.println("^");
         System.err.println(le.getMessage());
      }
      System.out.println();
   }
}

Listing 2's main() method invokes the lex() utility method to demonstrate lexical analysis via Lexan. Each call to this method passes a Class object for a class of tokens and a string to analyze.

The lex() method first instantiates the Lexan class, passing the Class object to Lexan's constructor. It then invokes Lexan's lex() method on the string.

If lexical analysis succeeds, Lexan's getTokLexes() method is called to return a list of TokLex objects. For each object, TokLex's getToken() method is called to return the token and its getLexeme() method is called to return the lexeme. Both values are output. If lexical analysis fails, either LexanException or LexException is thrown and handled appropriately.

For brevity, let's consider only MathTokens out of the remaining classes making up this application. Listing 3 presents this class's source code.

Listing 3. Describing a set of tokens for a small math language

import ca.javajeff.lexan.Token;

public final class MathTokens
{
   public final static Token FUNC = new Token("FUNC", "sin|cos|exp|ln|sqrt");
   public final static Token LPAREN = new Token("LPAREN", "\\(");
   public final static Token RPAREN = new Token("RPAREN", "\\)");
   public final static Token PLUSMIN = new Token("PLUSMIN", "[+-]");
   public final static Token TIMESDIV = new Token("TIMESDIV", "[*/]");
   public final static Token CARET = new Token("CARET", "\\^");
   public final static Token INTEGER = new Token("INTEGER", "[0-9]+");
   public final static Token ID = new Token("ID", "[a-zA-Z][a-zA-Z0-9_]*");
}

Listing 3 reveals that MathTokens defines a sequence of Token constants. Each constant is initialized to a Token object. That object's constructor receives a string naming the token, along with a regex that describes all character strings belonging to that token. The string-based token name should match the name of the constant (for clarity), but this isn't mandatory.

The position of a Token constant in the list of Tokens is important. Token constants higher in the list take precedence over constants that are lower down. For example, when sin is encountered, Lexan chooses FUNC instead of ID as the token. If ID appeared before FUNC, ID would be chosen.

Compiling and running LexanDemo

The source download for this article includes the lexan.zip archive, which contains all the distribution files for Lexan. Unzip this archive and set the current directory to the lexan home directory's demos subdirectory.

If you're using Windows, execute the following command to compile the demo's source files:

javac -cp ..\library\lexan.jar *.java

Following a successful compilation, execute this command to run the demo:

java -cp ..\library\lexan.jar;. LexanDemo

You should observe the following output:

FUNC: sin
LPAREN: (
ID: x
RPAREN: )
TIMESDIV: *
LPAREN: (
INTEGER: 1
PLUSMIN: +
ID: var_12
RPAREN: )
ONE: 1
ZERO: 0
ONE: 1
ZERO: 0
ONE: 1
ONE: 1
ONE: 1
ZERO: 0
1 20
--^
Unexpected character in input: 20
no tokens

The Unexpected character in input: 20 message arises from a thrown LexanException, which is caused by BinTokens not defining a Token constant with 2 as its regex. Note the exception handler's output of the text being lexed and the location of the offensive character. The no tokens message arises from a thrown LexException because NoTokens defines no Token constants.

Behind the scenes

Lexan relies on the Lexan class as its engine. Check out Listing 4 to see how this class is implemented and how regexes contribute to the engine's reusability.

Listing 4. Architecting a regex-based lexical analyzer

package ca.javajeff.lexan;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;

/**
 *  A lexical analyzer. You can use this class to transform an input stream of 
 *  characters into an output stream of tokens.
 *
 *  @author Jeff Friesen
 */

public final class Lexan
{
   private List<TokLex> tokLexes;

   private Token[] values;

   /**
    *  Initialize a lexical analyzer to a set of Token objects.
    *
    *  @param tokensClass the Class object of a class containing a set of Token
    *         objects
    *
    *  @throws LexanException unable to construct a Lexan object, possibly 
    *          because there are no Token objects in the class
    */

   public Lexan(Class<?> tokensClass) throws LexanException
   {
      try
      {
         tokLexes = new ArrayList<>();
         List<Token> _values = new ArrayList<>();
         Field[] fields = tokensClass.getDeclaredFields();
         for (Field field: fields)
            if (field.getType().getName().equals("ca.javajeff.lexan.Token"))
               _values.add((Token) field.get(null));
         values = _values.toArray(new Token[0]);
         if (values.length == 0)
            throw new LexanException("no tokens");
      }
      catch (IllegalAccessException iae)
      {
         throw new LexanException(iae.getMessage());
      }

   /**
    *  Get this lexical analyzer's list of toklexes.
    *
    *  @return list of toklexes
    */

   public List<TokLex> getTokLexes()
   {
      return tokLexes;
   }

   /**
    *  Lex an input string into a list of toklexes.
    *
    *  @param str the string being lexed
    *
    *  @throws LexException unexpected character found in input
    */

   public void lex(String str) throws LexException
   {
      String s = new String(str).trim(); // remove leading whitespace
      int index = (str.length() - s.length());
      tokLexes.clear();
      while (!s.equals(""))
      {
         boolean match = false;
         for (int i = 0; i < values.length; i++)
         {
            Token token = values[i];
            Matcher m = token.getPattern().matcher(s);
            if (m.find())
            {
               match = true;
               tokLexes.add(new TokLex(token, m.group().trim()));
               String t = s;
               s = m.replaceFirst("").trim(); // remove leading whitespace
               index += (t.length() - s.length());
               break;
            }
         }
         if (!match)
            throw new LexException("Unexpected character in input: " + s, str,
                                   index);
      }
   }
}

The code in the lex() method is based on the code presented in the blog post "Writing a Parser in Java: The Tokenizer" from Cogito Learning. Check out that post to learn more about how Lexan leverages the Regex API for code compilation .

In conclusion

Regular expressions are a useful tool that every developer needs to understand. Java's Regex API makes it easy to integrate them into your applications and libraries. Now that you possess a basic understanding of regexes and this API, study java.util.regex's SDK documentation to learn even more about regexes and additional API methods.

1 2 3 Page 3
Page 3 of 3