Looking for lex and yacc for Java? You don't know Jack

How to get and use Sun's free automatic Java parser generator -- a unique new tool that's a must for Java compiler developers

Sun has released Jack, a new tool written in Java that automatically generates parsers by compiling a high-level grammar specification stored in a text file. This article will serve as an introduction to this new tool. The first part of the article covers a brief introduction to automatic parser generation, and my first experiences with them. Then the article will focus on Jack and how you can use it to generate parsers and applications built with those parsers, based on your high-level grammar.

Automatic compiler parser generation

A parser is one of the most common components of a computer application. It converts text that can be read by humans into data structures known as parse trees, which are understood by the computer. I distinctly remember my introduction to automatic parser generation: In college I had completed a class on compiler construction. With the help of my wife to be, I had written a simple compiler that could turn programs written in a language made up for the class into executable programs. I remember feeling very accomplished at that point.

In my first "real" job after college, I got an assigment to create a new graphics processing language to compile into commands for a graphics coprocessor. I started with a freshly composed grammar and prepared to launch into the multiweek project of putting together a compiler. Then a friend showed me the Unix utilities lex and yacc. Lex built lexical analyzers from regular expressions, and yacc reduced a grammar specification into a table-driven compiler that could produce code when it had successfully parsed productions from that grammar. I used lex and yacc, and in less than a week my compiler was up and running! Later, the Free Software Foundation's GNU project produced "improved" versions of lex and yacc -- named flex and bison -- for use on platforms that did not run a derivative of the Unix operating system.

The world of automatic parser generation advanced again when Terrence Parr, then a student at Purdue University, created the Purdue Compiler Construction Tool Set or PCCTS. Two components of PCCTS -- DFA and ANTLR -- provide the same functions as lex and yacc; however the grammars that ANTLR accepts are LL(k) grammars as opposed to the LALR grammars used by yacc. Furthermore, the code that PCCTS generates is much more readable than the code generated by yacc. By generating code that is easier to read, PCCTS makes it easier for a human reading the code to understand what the various pieces are doing. This understanding can be essential when trying to diagnose errors in the grammar specification. PCCTS quickly developed a following of folks who found its files easier to use than yacc.

The power of automatic parser generation is that it allows users to concentrate on the grammar and not worry about the correctness of the implementation. This can be a tremendous time-saver in both simple and complex projects.

Jack steps up to the plate

I rate tools by the generality of the problem they solve. As the requirement to parse text input comes up again and again, automatic parser generation rates pretty highly in my toolbox. Combined with the rapid development cycle of Java, automatic parser generation provides a tool for compiler design that is hard to beat.

Jack (rhymes with yacc) is a parser generator, in the spirit of PCCTS, that Sun has released for free to the Java programming community. Jack is an exceptionally easy tool to describe: Simply put, you give it a set of combined grammatical and lexing rules in the form of a .jack file and run the tool, and it gives you back a Java class that will parse that grammar. What could be easier?

Getting ahold of Jack is also quite easy. First you download a copy from the Jack home page. This comes to you in the form of a self-unpacking Java class called install. To install Jack you need to invoke this install class, which, on a Windows 95 machine is done using the command: C:>java install.

The command shown above assumes that the java command is in your command path and that the class path has been set up appropriately. If the above command did not work, or if you aren't sure whether or not you have things set up properly, open an MS-DOS window by traversing the Start->Programs->MS-DOS Prompt menu items. If you have the Sun JDK installed, you can type these commands:

C:> path C:\java\bin;%path%
C:> set CLASSPATH=.;c:\java\lib\classes.zip

If Symantec Cafe version 1.2 or later is installed, you can type these commands:

C:> path C:\cafe\java\bin;%path%

The class path should already be set up in a file called sc.ini in the bin directory of Cafe.

Next, type the java install command from above. The install program will ask you what directory you wish to install into, and the Jack subdirectory will be created below that.

Using Jack

Jack is written entirely in Java, so having the Jack classes means that this tool is instantly available on every platform that supports the Java virtual machine. However, it also means that on Windows boxes you have to run Jack from the command line. Let's say you chose the directory name JavaTools when you installed Jack on your system. To use Jack you will need to add Jack's classes to your class path. You can do this in your autoexec.bat file or in your .cshrc file if you are a Unix user. The critical command is something like the line shown below:

C:> set CLASSPATH=.;C:\JavaTools\Jack\java;C:\java\lib\classes.zip

Note that Symantec Cafe users can edit the sc.ini file and include the Jack classes there, or they can set CLASSPATH explicitly as shown above.

Setting the environment variable as shown above puts the Jack classes in the CLASSPATH between "." (the current directory) and the base system classes for Java. The main class for Jack is COM.sun.labs.jack.Main. Capitalization is important! There are exactly four capital letters in the command ('C', 'O', 'M', and another 'M'). To run Jack manually, type the command:

C:> java COM.sun.labs.jack.Main parser-input.jack

If you don't have the Jack files in your class path, you can use this command:

C:> java -classpath .;C:\JavaTools\Jack\java;c:\java\lib\classes.zip COM.sun.labs.jack.Main parser-input.jack

As you can see, this gets a bit long. To minimize typing, I put the invocation into a .bat file named Jack.bat. At some point in the future, a simple C wrapper program will become available, perhaps even as you read this. Check out the Jack home page for the availability of this and other programs.

When Jack is run, it creates several files in the current directory that you will later compile into your parser. Most are either prefixed with the name of your parser or are common to all parsers. One of these, however, ASCII_CharStream.java, may collide with other parsers, so it is probably a good idea to start off in a directory containing only the .jack file you are going to use to generate the parser.

Once you run Jack, if the generation has gone smoothly you will have a bunch of .java files in the current directory with various interesting names. These are your parsers. I encourage you to open them up with an editor and look them over. When you are ready you can compile them with the command

C:> javac -d . 
ParserName.java

where ParserName is the name you gave your parser in the input file. More on that in a bit. If all of the files for your parser don't compile, you can use the brute force method of typing:

C:> javac *.java

This will compile everything in the directory. At this point your new parser is ready to use.

Jack parser descriptions

Jack parser description files have the extension .jack and are divided into three basic parts: options and base class; lexical tokens; and non-terminals. Let's look at a simple parser description (this is included in the examples directory that comes with Jack).

options {
    LOOKAHEAD = 1;
}
PARSER_BEGIN(Simple1)
public class Simple1 {
    public static void main(String args[]) throws ParseError {
    Simple1 parser = new Simple1(System.in);
    parser.Input();
    }
}
PARSER_END(Simple1)

The first few lines above describe options for the parser; in this case LOOKAHEAD is set to 1. There are other options, such as diagnostics, Java Unicode handling, and so on, that also can be set here. Following the options comes the base class of the parser. The two tags PARSER_BEGIN and PARSER_END bracket the class that becomes the base Java code for the resulting parser. Note that the class name used in the parser specification must be the same in the beginning, middle, and ending part of this section. In the example above, I've put the class name in bold face to make this clear. As you can see in the code above, this class defines a static main method so that the class can be invoked by the Java interpreter on the command line. The main method simply instantiates a new parser with an input stream (in this case System.in) and then invokes the Input method. The Input method is a non-terminal in our grammar, and it is defined in the form of an EBNF element. EBNF stands for Extended Backus-Naur Form. The Backus-Naur form is a method for specifying context-free grammars. The specification consists of a terminal on the left-hand side, a production symbol, which is typically "::=", and one or more productions on the right-hand side. The notation used is typically something like this:

 Keyword ::= "if" |
"then" | "else"

This would be read as, "The Keyword terminal is one of the string literals 'if', 'then', or 'else.'" In Jack, this form is extended to allow the left-hand part to be represented by a method, and the alternate expansions may be represented by regular expressions or other non-terminals. Continuing with our simple example, the file contains the following definitions:

void Input() :
{}
{
    MatchedBraces() "\n" <EOF>
}
void MatchedBraces() :
{}
{
    "{" [ MatchedBraces() ] "}"
}

This simple parser parses the grammar shown below:

Input::=MatchedBraces "\n" <EOF>
MatchedBraces::="{" [ MatchedBraces ] "}"

I've used italics to show the non-terminals on the right side of the productions and boldface to show literals. As you can see, the grammar simply parses matched sets of brace "{" and "}" characters. There are two productions in the Jack file to describe this grammar. The first terminal, Input, is defined by this definition to be three items in sequence: a MatchedBraces terminal, a newline character, and an end-of-file token. The <EOF> token is defined by Jack so that you don't have to specify it for your platform.

When this grammar is generated, the left-hand sides of the productions are turned into methods inside the Simple1 class; when compiled, the Simple1 class reads characters from System.in and verifies that they contain a matching set of braces. This is accomplished by invoking the generated method Input, which is transformed by the generation process into a method that parses an Input non-terminal. If the parse fails, the method throws the exception ParseError, which the main routine can catch and then complain about if it so chooses.

Of course there's more. The block delineated by "{" and "}" after the terminal name -- which is empty in this example -- can contain arbitrary Java code that is inserted at the front of the generated method. Then, after each expansion, there is another optional block that can contain arbitrary Java code to be executed when the parser successfully matches that expansion.

A more complicated example

So how about an example that's a bit more complicated? Consider the following grammar, again broken down into pieces. This grammar is designed to interpret mathematical equations using the four basic operators -- addition, multiplication, subtraction, and division. The source can be found here:

options {
    LOOKAHEAD=1;
}
PARSER_BEGIN(Calc1)
public class Calc1 {
    public static void main(String args[]) throws ParseError {
    Calc1 parser = new Calc1(System.in);
    while (true) {
        System.out.print("Enter Expression: ");
        System.out.flush();
        try {
        switch (parser.one_line()) {
            case -1:
            System.exit(0);
            default:
            break;
        }
        } catch (ParseError x) {
        System.out.println("Exiting.");
        throw x;
        }
    }
    }
}
PARSER_END(Calc1)

The first part is nearly the same as Simple1, except that the main routine now calls the terminal one_line repeatedly until it fails to parse. Next comes the following code:

IGNORE_IN_BNF :
{}
{
    " " | "\r" | "\t"
}
TOKEN :
{ }
{
    < EOL: "\n" >
}
TOKEN : /* OPERATORS */
{ }
{
    < PLUS: "+" >
|   < MINUS: "-" >
|   < MULTIPLY: "*" >
|   < DIVIDE: "/" >
}
TOKEN :
{ }
{
    < CONSTANT: ( <DIGIT> )+ >
|   < #DIGIT: ["0" - "9"] >
}

These definitions cover the basic terminals in which the grammar is specified. The first, named IGNORE_IN_BNF, is a special token. Any tokens read by the parser that match the characters defined in an IGNORE_IN_BNF token are silently discarded. As you can see in our example, this causes the parser to ignore space characters, tabs, and carriage return characters in the input.

The second definition defines a token, EOL, that identifies the end of line. Because the IGNORE_IN_BNF token is discarding carriage return characters as they are read from the input stream, only the new line characters -- not carriage returns -- will ever be seen by this parser. The definition of the end-of-line token takes advantage of those discards by specifying that an end-of-line token is simply a new line character. On Unix systems, the new line character is already the standard indication of an end of line, but on Windows-based systems, both new lines and carriage returns are present to denote an end of line. When the carriage returns are discarded, both systems appear to have the same end-of-line indication.

The third token definition defines the characters that this grammar will interpret as operators. The characters are obviously the most common ones used for these operations.

The last definition takes into account the numbers used in our example. This definition of CONSTANT matches integers composed of the digits 0 through 9.

There are two constructs in the definition of CONSTANT that are interesting enough to merit a bit of discussion. The definition of DIGIT is a character in the range of character "0" and character "9" inclusive. Using the dash between the character literals for 0 and 9 tells Jack that you mean all characters in the range. The alternative would be to specify "0", "1", "2", ... , "9," which requires a lot of typing. Jack uses the combination of the symbols "[" and "]" with the character literals separated by either "," or "-", to define tokens that are one of the characters enclosed by brackets. This definition would suffice if we never needed to represent a number above 9. However, numbers can have more than one digit in them so the definition of CONSTANT is <DIGIT> surrounded by "(" and ")+". The parentheses around DIGIT group it as a token that may repeat in the definition of a CONSTANT. The closing parenthesis followed by plus (+) means that a CONSTANT is composed of at least one but possibly more DIGIT tokens right next to each other. Thus, this definition will accept "0", "123", and "124512561255610697" as numbers, but it will consider "1 2 3" as three numbers (they are separated by spaces).

After our lexical tokens we define the non-terminals. These are defined from the top down, meaning the most general definition is first, followed by increasingly more detailed productions until we get to the handling of constants.

The first production is defined as:

one_line ::= sum <EOL>
| <EOL>
| <EOF>

which defines one "line" of input to be a sum non-terminal that is followed by an end of line, a single end of line, or a single end of file. This is expressed in the Jack grammar as follows:

int one_line() :
{}
{
    sum() <EOL>
    { return 1; }
  | <EOL>
    { return 0; }
  | <EOF>
    { return -1; }
}

Don't get confused by the Java code interspersed in the definition; basically it is there to help the parser distinguish between the three kinds of results. Regardless of the particular production matched, the return result is based on the production that actually matched the input stream. If you go back and look at the main parser code, you will notice how it handles each return.

The next production is defined as follows:

sum ::= term
          | term "+" term
          | term "-" term

which means that a sum is one of a term, two term productions separated by a plus (+) character, or two term productions separated by a minus(-) character. This is expressed in the Jack grammar as follows:

void sum() :
{ }
{
    term() (( <PLUS> | <MINUS> ) term())*
}

In the sum Jack production, the parenthesis around the terminals <PLUS> | <MINUS> means "match one of these two terminals." The parenthesis around this sub-production and term groups the sub expression and term into an expansion unit. Following the closing parenthesis the character star (*) means that there can be zero or more occurrences of this trailing group. Thus if we represent T as a term, the following groups will match the sum production: T, T+T, T-T, T+T-T, and T-T+T, but not T+T*T. The last sequence, T+T*T, does not match the sum production because that production does not allow the MULTIPLY (*) token to be present in the input stream. We add the mulitplicitive operators in the next definition as follows:

void term() :
{ }
{
    unary() (( <MULTIPLY> | <DIVIDE> ) unary())*
}

You will notice that the definition for term is just like sum, except that the operator terminals were changed to <MULTIPLY> and <DIVIDE>. Building the productions this way has a very useful property: The term production must be completely satisfied before the sum production is processed. Thus, if you have the input to this parser of T+T*U, this is processed as T+(T*U), as one might expect. The precedence relationship of the '*' operation is established by making it a subexpansion in the parsing structure of sum. It is also useful to note that these two productions are left associative -- the expression T*T*T*T is parsed as (((T*T)*T)*T).

There are two final definitions, one for unary and one for element. These are shown below.

void unary() :
{ }
{
    <MINUS> element()
  |     element()
}
void element() :
{}
{
    <CONSTANT>
|   "(" sum() ")"
}

The unary production provides a definition for the unary minus operation. The minus symbol is not parsed as part of the CONSTANT token; rather, if the characters "-55" are read on input, they are parsed as two tokens, "-" "55". This makes the definition of the CONSTANT token cleaner, and this code can parse "-55" just as easily as it parses "- ( 50 + 5 )." The element production returns a number to be operated on by the other productions, unless it sees a left parenthesis. If it does, it looks for another expression inside the parenthesis and then passes on the result of that expression. This expansion then treats ( T +T ) as an element rather than as a sum. So the expression T * (T + T) parses as you would expect it to.

You can, of course, compile this grammar with Jack, and it would generate a functional Calc1 class for you, however, all that class would do would be to throw an exception when an invalid expression was given to it. That's no fun, so let's give our creation some life and learn a bit more about Jack in the process.

Animating the Calc grammar

Before I begin, I will include the next example grammar here.

The first step is to make the base class do something useful. In the case of the calculating grammar, I modified it as follows. (The name Calc1i is a convention that I've adopted for this article to indicate that the grammar is an implementation of Calc1.)

PARSER_BEGIN(Calc1i)
public class Calc1i {
    static java.util.Stack argStack = new java.util.Stack();
    public static void main(String args[]) throws ParseError {
    Calc1i parser = new Calc1i(System.in);
    while (true) {
        System.out.print("Enter Expression: ");
        System.out.flush();
        try {
        switch (parser.one_line()) {
            case -1:
            System.exit(0);
            case 0:
            break;
            case 1:
            int x = ((Integer) argStack.pop()).intValue();
            System.out.println("Total = " + x);
            break;
        }
        } catch (ParseError x) {
        System.out.println("Exiting.");
        throw x;
        }
    }
    }
}
PARSER_END(Calc1i)

As you can see, the class now declares a new static class variable argStack that is a java.util.Stack object. The main method's while loop that repeatedly calls one_line is modified to pop the top value off the argument stack and print it. If the ParseError exception is caught, the message "Exiting" is printed, followed by a stack trace of where the error occurred. In a production grammar you might simply print "Error: Exiting" and be done. If you recall, the one_line production was specified as follows:

int one_line() :
{}
{
    sum() <EOL>
    { return 1; }
  |  <EOL>
    { return 0; }
  |  <EOF>
    { return -1; }
}

The first thing to notice is that the method name is declared as returning an integer. There is a minor limitation on the number of types that this method can return; check the Jack documentation for a complete description. The Java code is enclosed in the characters "{" and "}". Jack's truly remarkable flexibility is apparent in that every terminal in the production can be followed by a bit of Java code. Furthermore, after every terminal, you are assured that all previous terminals in the current expansion also have been parsed. Thus, if you wanted to, you could write sum() { do_something } EOL { do_something_else } and this would not interfere with writing | EOL { do_yet_another_thing }. While both EOL tokens would be followed by some Java code, only the tokens that were successfully parsed have their code executed.

Refer back to the definition of the main method, which runs the parser in an endless while loop. When the method one_line returns a value of 1, it indicates that an expression has been successfully parsed and that the result of the expression is available on the Stack object named argStack. The main then pops off the topmost value on the stack and prints it out. If you are wondering how the result ended up on argStack, consider the implementation of sum shown in the code below:

void sum() :
{ Token x; } /* This is declared for the whole method */
{
    term() ( 
    ( x = <PLUS> | x = <MINUS> ) term()
    {   /* these int variables are locally defined. */
        int a = ((Integer) argStack.pop()).intValue();
        int b = ((Integer) argStack.pop()).intValue();
        if ( x.kind == PLUS )
        argStack.push(new Integer(b + a));
        else
        argStack.push(new Integer(b - a));
    }
    )*
}

As you can see in the code above, the Java block is the first thing after the sum definition. Up to now, that block hadn't been used, but in this case I put a declaration for the variable x in it that I wanted to use in the production later. This variable is a reference to a Token object; the Token class is defined by Jack. Tokens are the objects that Jack stores all of its terminals in, and they are defined by the generated classes.

Next, you will notice that the construction "( <PLUS> | <MINUS> )" has be augmented to read "( x = <PLUS> | x = <MINUS> )." When Jack lexically analyzes a token, the result is returned as a Token object. In the case of the sum production, if the "+" character was seen, Jack will return a token whose instance variable kind will contain PLUS and whose instance variable image will contain the parsed string "+". Conversely if the "-" character was seen, the token returned will be of kind MINUS. The tricky bit is that while this production defines three possible choices, term, term + term, and term - term, only one of them will match. So when the second term production is parsed, x will contain a token with either a plus or a minus, but not both.

Following the second term terminal there is another Java code block that is still contained inside the "(" and ")*" characters. This block pops two values off the argStack and then, depending on whether x was a plus or minus, executes the add or subtract operation and pushes the result back onto the stack. Where did the values come from? We'll see in a minute, but before we do, think for a minute what the new version of the term EBNF looks like. When you are done, try comparing your result with the source. In the interest of brevity, I'll skip it and move on to element, which is shown below.

void element() :
{}
{
    <CONSTANT>
    {   try {
            int x = Integer.parseInt(token.image);
            argStack.push(new Integer(x));
        } catch (NumberFormatException ee) {
        argStack.push(new Integer(0));
        }
    }
|
    "(" sum() ")"
}

As you can see, element does not declare any local variables; there are only two legal expansions. First the parser may have seen a Constant, which, if you recall from the token declarations before, was a string of one or more digits between 0 and 9. If this is the expansion that Jack is parsing, the code block following Constant is executed. This block converts the string in token into an integer using the parseInt method in the Integer class. Note that token is effectively a global variable that holds the current token, and the field image holds the string representing this token. Another item of interest is that while this code catches NumberFormatException, we can be assured that this exception will never occur because Jack's lexical analyzer wouldn't call it a Constant if it wasn't a legal integer.

Once the number is parsed, it is pushed onto argStack to be used by the other productions in this grammar. Alternatively, the parser may have seen a left parenthesis, in which case it will recursively parse another sum production. When the sum terminal is parsed, no Java code is executed as a result of that match. The code is not needed as the result of parsing an element terminal is to leave the value of the element on the top of the argument stack. Since the implementation of sum leaves its result on the argument stack, there is no need to do any additional work.

If you haven't figured it out completely, this walkthrough shows how this grammar parses the expression "10 + 2 * 8 / (14 + 2 )."

Moving deeper

There is no limit to what you can do with Jack, however a couple of items will help you move ahead in your experiments. The Calc1 grammar used integer constants and left associative infix operators. Of course my .99 calculator can do floating point. So how might one extend the grammar to handle floating-point numbers? The answer is shown in the Calc2 grammar, and it is implemented in the Calc2i grammar. To summarize the differences, the Calc2 grammar defines the lexical token CONSTANT to contain a floating-point number, and the grammar adds both an exponentiation operator, represented by "**", as well as an EBNF production to parse it.

The new definition of Constant is as follows:

TOKEN : /* A Floating point numeric literal */
{ }
{
    < CONSTANT: <FLOAT>
        | <FLOAT> ( ["e","E"] ([ "-","+"])? <INTEGER> )?
        >
  |     < #FLOAT: <INTEGER> 
        | <INTEGER> ( "." <INTEGER> )?
        | "." <INTEGER>
        >
  |     < #INTEGER: ( <DIGIT> )+ >
  |     < #DIGIT: ["0" - "9"] >
}

Reading from the bottom of the definition to the top, you will first note that the previous definition of CONSTANT has been rechristened as INTEGER. The definition of INTEGER is used to create a new definition for a token named FLOAT. The FLOAT token is matched when the lexical analyzer takes one of the three possible choices:

  • The first choice is simply an INTEGER token, or in plain English, a whole number. This choice matches numbers of the form 1, 2, 345, and so on.

  • The second choice is an INTEGER, optionally followed by a period and another INTEGER. This choice matches numbers of the form 1.0, 999.112, and so on.

  • The final choice is an INTEGER that is immediately preceded by a decimal point (.). This choice matches numbers of the form .1, .355, and so on.

However, since representing extremely large numbers or extremely small numbers is cumbersome using just the floating-point notation, the new definition of the CONSTANT token defines numbers as being optionally represented in scientific notion. Numbers represented by scientific notation are typically written with a floating-point mantissa and an exponent. CONSTANT is defined to match the sequence: a FLOAT token, the letter "e" or "E", optionally the symbol plus (+) or minus(-), and finally an INTEGER token representing the exponent. This representation for numbers was chosen because it matches the numbers that the method valueOf in the Double class can parse.

The other significant change in the new grammar was the addition of an infix operator '**' to represent exponentiation. This operator is right associative so that 2**2**2 is parsed as (2**(2**2)). To achieve right associativity we can use a right-recursive definition. The EBNF for the exp production is as follows:

void exp() :
{ }
{
    unary() ( LOOKAHEAD( <EXP> ) <EXP> exp() )*
}

This production is inserted between the term production and the unary production. Note that on the right-hand side, the production refers to itself -- thus each match of the expansion "< EXP> exp()" means that right-hand expansions have already been parsed. While not strictly necessary for this grammar, the "Lookahead( <EXP> )" command tells Jack to look ahead in the incoming stream to see if there is an exponentiation operator "next" in line to be parsed. If there is, then this expansion is taken and the following token must match an exp terminal, otherwise a ParseError is thrown. If the next token in the stream that will be read is not an <EXP> token (the lookahead failed), then the parser can safely ignore this expansion since it requires the <EXP> to be a possible match. By carefully using Lookahead, ambiguities in the grammar can be distinguished.

Exercises

If you've read through this article, and you think you understand how all of this works but would like to test your knowledge, here is a solved problem for you. Try to augment the Calc2 grammar so that it will accept functions like sin and cos. The BNF for this can be written as follows:

function ::= <function-name> "(" [ <expression> ("," <expression>)* ] "}"

In English a function is a function name followed by an open parenthesis, followed by zero or more expressions separated by commas, followed by a close parenthesis. You can see my solution in Calc3. For extra credit, parse the function as a lexical token. What does that change and why?

Final thoughts and conclusions

Jack is an extremely useful tool to add to the Java arsenal. With a tool like Jack, experimenting with language design becomes fairly straightforward. As a way to validate the design and implementation of Jack, the designers include a grammar for the Java language itself. This grammar can be used as the basis for building any tools that might need to parse Java files in its operation. Using the supplied Java grammar as a starting point, writing your own version of javadoc, for example, becomes much easier. Experimenting with other syntactic constructs becomes easier as well. Finding the Java grammar in the distribution can be tough though; it isn't in the examples or in the contrib directory. It is down in the class directory in Jack\java\COM\sun\labs\javaparser.

Jack includes another useful package -- JackDoc. This tool (written using Jack, of course) parses a Jack file and produces an HTML page with the grammar. Here is a sample of the Calc3 grammar, post-processed into HTML by JavaDoc. This is very useful for documentation except for the fact that it doesn't include the definition of terminals in the resulting grammar; only non-terminals are included. Still, for turning Jack files into formatted BNF, it can't be beat.

The initial release of Jack, version 0.5, indicates that the authors are about "halfway" to a finished product. Don't let the less-than-1.0 version number fool you; this is a tool that is very useful as it stands now. Experience with Jack will no doubt guide its authors, as will comments from you. Send these to jack-interest@asap.eng.sun.com.

It is important to note that Jack is not very efficient in its generation of code for simple grammars. My simple calculator grammar generated nearly 35 kilobytes of Java source code, much of it general-purpose parsing code. Adding another pair of operators only increased the source code by about 6 percent and the runtime code by slightly more than 4 percent. More importantly, however, my modified grammar worked the first time, and I have a high degree of confidence in its correctness. Handcrafted grammars can be achieved with fewer lines of code, but the value of having code that both works and is correct cannot be underestimated.

Chuck McManis is currently the director of system software at FreeGate Corp. FreeGate is a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, McManis was a member of the Java group. He joined the Java group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" homepage on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft developing networking applications, where he did the initial design of NIS+.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies