Looking for lex and yacc for Java? You don't know Jack

How to get and use Sun's free automatic Java parser generator -- a unique new tool that's a must for Java compiler developers

1 2 3 Page 2
Page 2 of 3

The second definition defines a token, EOL, that identifies the end of line. Because the IGNORE_IN_BNF token is discarding carriage return characters as they are read from the input stream, only the new line characters -- not carriage returns -- will ever be seen by this parser. The definition of the end-of-line token takes advantage of those discards by specifying that an end-of-line token is simply a new line character. On Unix systems, the new line character is already the standard indication of an end of line, but on Windows-based systems, both new lines and carriage returns are present to denote an end of line. When the carriage returns are discarded, both systems appear to have the same end-of-line indication.

The third token definition defines the characters that this grammar will interpret as operators. The characters are obviously the most common ones used for these operations.

The last definition takes into account the numbers used in our example. This definition of CONSTANT matches integers composed of the digits 0 through 9.

There are two constructs in the definition of CONSTANT that are interesting enough to merit a bit of discussion. The definition of DIGIT is a character in the range of character "0" and character "9" inclusive. Using the dash between the character literals for 0 and 9 tells Jack that you mean all characters in the range. The alternative would be to specify "0", "1", "2", ... , "9," which requires a lot of typing. Jack uses the combination of the symbols "[" and "]" with the character literals separated by either "," or "-", to define tokens that are one of the characters enclosed by brackets. This definition would suffice if we never needed to represent a number above 9. However, numbers can have more than one digit in them so the definition of CONSTANT is <DIGIT> surrounded by "(" and ")+". The parentheses around DIGIT group it as a token that may repeat in the definition of a CONSTANT. The closing parenthesis followed by plus (+) means that a CONSTANT is composed of at least one but possibly more DIGIT tokens right next to each other. Thus, this definition will accept "0", "123", and "124512561255610697" as numbers, but it will consider "1 2 3" as three numbers (they are separated by spaces).

After our lexical tokens we define the non-terminals. These are defined from the top down, meaning the most general definition is first, followed by increasingly more detailed productions until we get to the handling of constants.

The first production is defined as:

one_line ::= sum <EOL>
| <EOL>
| <EOF>

which defines one "line" of input to be a sum non-terminal that is followed by an end of line, a single end of line, or a single end of file. This is expressed in the Jack grammar as follows:

int one_line() :
{}
{
    sum() <EOL>
    { return 1; }
  | <EOL>
    { return 0; }
  | <EOF>
    { return -1; }
}

Don't get confused by the Java code interspersed in the definition; basically it is there to help the parser distinguish between the three kinds of results. Regardless of the particular production matched, the return result is based on the production that actually matched the input stream. If you go back and look at the main parser code, you will notice how it handles each return.

The next production is defined as follows:

sum ::= term
          | term "+" term
          | term "-" term

which means that a sum is one of a term, two term productions separated by a plus (+) character, or two term productions separated by a minus(-) character. This is expressed in the Jack grammar as follows:

void sum() :
{ }
{
    term() (( <PLUS> | <MINUS> ) term())*
}

In the sum Jack production, the parenthesis around the terminals <PLUS> | <MINUS> means "match one of these two terminals." The parenthesis around this sub-production and term groups the sub expression and term into an expansion unit. Following the closing parenthesis the character star (*) means that there can be zero or more occurrences of this trailing group. Thus if we represent T as a term, the following groups will match the sum production: T, T+T, T-T, T+T-T, and T-T+T, but not T+T*T. The last sequence, T+T*T, does not match the sum production because that production does not allow the MULTIPLY (*) token to be present in the input stream. We add the mulitplicitive operators in the next definition as follows:

void term() :
{ }
{
    unary() (( <MULTIPLY> | <DIVIDE> ) unary())*
}

You will notice that the definition for term is just like sum, except that the operator terminals were changed to <MULTIPLY> and <DIVIDE>. Building the productions this way has a very useful property: The term production must be completely satisfied before the sum production is processed. Thus, if you have the input to this parser of T+T*U, this is processed as T+(T*U), as one might expect. The precedence relationship of the '*' operation is established by making it a subexpansion in the parsing structure of sum. It is also useful to note that these two productions are left associative -- the expression T*T*T*T is parsed as (((T*T)*T)*T).

There are two final definitions, one for unary and one for element. These are shown below.

void unary() :
{ }
{
    <MINUS> element()
  |     element()
}
void element() :
{}
{
    <CONSTANT>
|   "(" sum() ")"
}

The unary production provides a definition for the unary minus operation. The minus symbol is not parsed as part of the CONSTANT token; rather, if the characters "-55" are read on input, they are parsed as two tokens, "-" "55". This makes the definition of the CONSTANT token cleaner, and this code can parse "-55" just as easily as it parses "- ( 50 + 5 )." The element production returns a number to be operated on by the other productions, unless it sees a left parenthesis. If it does, it looks for another expression inside the parenthesis and then passes on the result of that expression. This expansion then treats ( T +T ) as an element rather than as a sum. So the expression T * (T + T) parses as you would expect it to.

You can, of course, compile this grammar with Jack, and it would generate a functional Calc1 class for you, however, all that class would do would be to throw an exception when an invalid expression was given to it. That's no fun, so let's give our creation some life and learn a bit more about Jack in the process.

Animating the Calc grammar

Before I begin, I will include the next example grammar here.

The first step is to make the base class do something useful. In the case of the calculating grammar, I modified it as follows. (The name Calc1i is a convention that I've adopted for this article to indicate that the grammar is an implementation of Calc1.)

PARSER_BEGIN(Calc1i)
public class Calc1i {
    static java.util.Stack argStack = new java.util.Stack();
    public static void main(String args[]) throws ParseError {
    Calc1i parser = new Calc1i(System.in);
    while (true) {
        System.out.print("Enter Expression: ");
        System.out.flush();
        try {
        switch (parser.one_line()) {
            case -1:
            System.exit(0);
            case 0:
            break;
            case 1:
            int x = ((Integer) argStack.pop()).intValue();
            System.out.println("Total = " + x);
            break;
        }
        } catch (ParseError x) {
        System.out.println("Exiting.");
        throw x;
        }
    }
    }
}
PARSER_END(Calc1i)

As you can see, the class now declares a new static class variable argStack that is a java.util.Stack object. The main method's while loop that repeatedly calls one_line is modified to pop the top value off the argument stack and print it. If the ParseError exception is caught, the message "Exiting" is printed, followed by a stack trace of where the error occurred. In a production grammar you might simply print "Error: Exiting" and be done. If you recall, the one_line production was specified as follows:

int one_line() :
{}
{
    sum() <EOL>
    { return 1; }
  |  <EOL>
    { return 0; }
  |  <EOF>
    { return -1; }
}

The first thing to notice is that the method name is declared as returning an integer. There is a minor limitation on the number of types that this method can return; check the Jack documentation for a complete description. The Java code is enclosed in the characters "{" and "}". Jack's truly remarkable flexibility is apparent in that every terminal in the production can be followed by a bit of Java code. Furthermore, after every terminal, you are assured that all previous terminals in the current expansion also have been parsed. Thus, if you wanted to, you could write sum() { do_something } EOL { do_something_else } and this would not interfere with writing | EOL { do_yet_another_thing }. While both EOL tokens would be followed by some Java code, only the tokens that were successfully parsed have their code executed.

Refer back to the definition of the main method, which runs the parser in an endless while loop. When the method one_line returns a value of 1, it indicates that an expression has been successfully parsed and that the result of the expression is available on the Stack object named argStack. The main then pops off the topmost value on the stack and prints it out. If you are wondering how the result ended up on argStack, consider the implementation of sum shown in the code below:

void sum() :
{ Token x; } /* This is declared for the whole method */
{
    term() ( 
    ( x = <PLUS> | x = <MINUS> ) term()
    {   /* these int variables are locally defined. */
        int a = ((Integer) argStack.pop()).intValue();
        int b = ((Integer) argStack.pop()).intValue();
        if ( x.kind == PLUS )
        argStack.push(new Integer(b + a));
        else
        argStack.push(new Integer(b - a));
    }
    )*
}

As you can see in the code above, the Java block is the first thing after the sum definition. Up to now, that block hadn't been used, but in this case I put a declaration for the variable x in it that I wanted to use in the production later. This variable is a reference to a Token object; the Token class is defined by Jack. Tokens are the objects that Jack stores all of its terminals in, and they are defined by the generated classes.

Next, you will notice that the construction "( <PLUS> | <MINUS> )" has be augmented to read "( x = <PLUS> | x = <MINUS> )." When Jack lexically analyzes a token, the result is returned as a Token object. In the case of the sum production, if the "+" character was seen, Jack will return a token whose instance variable kind will contain PLUS and whose instance variable image will contain the parsed string "+". Conversely if the "-" character was seen, the token returned will be of kind MINUS. The tricky bit is that while this production defines three possible choices, term, term + term, and term - term, only one of them will match. So when the second term production is parsed, x will contain a token with either a plus or a minus, but not both.

Following the second term terminal there is another Java code block that is still contained inside the "(" and ")*" characters. This block pops two values off the argStack and then, depending on whether x was a plus or minus, executes the add or subtract operation and pushes the result back onto the stack. Where did the values come from? We'll see in a minute, but before we do, think for a minute what the new version of the term EBNF looks like. When you are done, try comparing your result with the source. In the interest of brevity, I'll skip it and move on to element, which is shown below.

void element() :
{}
{
    <CONSTANT>
    {   try {
            int x = Integer.parseInt(token.image);
            argStack.push(new Integer(x));
        } catch (NumberFormatException ee) {
        argStack.push(new Integer(0));
        }
    }
|
    "(" sum() ")"
}

As you can see, element does not declare any local variables; there are only two legal expansions. First the parser may have seen a Constant, which, if you recall from the token declarations before, was a string of one or more digits between 0 and 9. If this is the expansion that Jack is parsing, the code block following Constant is executed. This block converts the string in token into an integer using the parseInt method in the Integer class. Note that token is effectively a global variable that holds the current token, and the field image holds the string representing this token. Another item of interest is that while this code catches NumberFormatException, we can be assured that this exception will never occur because Jack's lexical analyzer wouldn't call it a Constant if it wasn't a legal integer.

Once the number is parsed, it is pushed onto argStack to be used by the other productions in this grammar. Alternatively, the parser may have seen a left parenthesis, in which case it will recursively parse another sum production. When the sum terminal is parsed, no Java code is executed as a result of that match. The code is not needed as the result of parsing an element terminal is to leave the value of the element on the top of the argument stack. Since the implementation of sum leaves its result on the argument stack, there is no need to do any additional work.

If you haven't figured it out completely, this walkthrough shows how this grammar parses the expression "10 + 2 * 8 / (14 + 2 )."

Moving deeper

There is no limit to what you can do with Jack, however a couple of items will help you move ahead in your experiments. The Calc1 grammar used integer constants and left associative infix operators. Of course my .99 calculator can do floating point. So how might one extend the grammar to handle floating-point numbers? The answer is shown in the Calc2 grammar, and it is implemented in the Calc2i grammar. To summarize the differences, the Calc2 grammar defines the lexical token CONSTANT to contain a floating-point number, and the grammar adds both an exponentiation operator, represented by "**", as well as an EBNF production to parse it.

The new definition of Constant is as follows:

1 2 3 Page 2
Page 2 of 3