Looking for lex and yacc for Java? You don't know Jack

How to get and use Sun's free automatic Java parser generator -- a unique new tool that's a must for Java compiler developers

Sun has released Jack, a new tool written in Java that automatically generates parsers by compiling a high-level grammar specification stored in a text file. This article will serve as an introduction to this new tool. The first part of the article covers a brief introduction to automatic parser generation, and my first experiences with them. Then the article will focus on Jack and how you can use it to generate parsers and applications built with those parsers, based on your high-level grammar.

Automatic compiler parser generation

A parser is one of the most common components of a computer application. It converts text that can be read by humans into data structures known as parse trees, which are understood by the computer. I distinctly remember my introduction to automatic parser generation: In college I had completed a class on compiler construction. With the help of my wife to be, I had written a simple compiler that could turn programs written in a language made up for the class into executable programs. I remember feeling very accomplished at that point.

In my first "real" job after college, I got an assigment to create a new graphics processing language to compile into commands for a graphics coprocessor. I started with a freshly composed grammar and prepared to launch into the multiweek project of putting together a compiler. Then a friend showed me the Unix utilities lex and yacc. Lex built lexical analyzers from regular expressions, and yacc reduced a grammar specification into a table-driven compiler that could produce code when it had successfully parsed productions from that grammar. I used lex and yacc, and in less than a week my compiler was up and running! Later, the Free Software Foundation's GNU project produced "improved" versions of lex and yacc -- named flex and bison -- for use on platforms that did not run a derivative of the Unix operating system.

The world of automatic parser generation advanced again when Terrence Parr, then a student at Purdue University, created the Purdue Compiler Construction Tool Set or PCCTS. Two components of PCCTS -- DFA and ANTLR -- provide the same functions as lex and yacc; however the grammars that ANTLR accepts are LL(k) grammars as opposed to the LALR grammars used by yacc. Furthermore, the code that PCCTS generates is much more readable than the code generated by yacc. By generating code that is easier to read, PCCTS makes it easier for a human reading the code to understand what the various pieces are doing. This understanding can be essential when trying to diagnose errors in the grammar specification. PCCTS quickly developed a following of folks who found its files easier to use than yacc.

The power of automatic parser generation is that it allows users to concentrate on the grammar and not worry about the correctness of the implementation. This can be a tremendous time-saver in both simple and complex projects.

Jack steps up to the plate

I rate tools by the generality of the problem they solve. As the requirement to parse text input comes up again and again, automatic parser generation rates pretty highly in my toolbox. Combined with the rapid development cycle of Java, automatic parser generation provides a tool for compiler design that is hard to beat.

Jack (rhymes with yacc) is a parser generator, in the spirit of PCCTS, that Sun has released for free to the Java programming community. Jack is an exceptionally easy tool to describe: Simply put, you give it a set of combined grammatical and lexing rules in the form of a .jack file and run the tool, and it gives you back a Java class that will parse that grammar. What could be easier?

Getting ahold of Jack is also quite easy. First you download a copy from the Jack home page. This comes to you in the form of a self-unpacking Java class called install. To install Jack you need to invoke this install class, which, on a Windows 95 machine is done using the command: C:>java install.

The command shown above assumes that the java command is in your command path and that the class path has been set up appropriately. If the above command did not work, or if you aren't sure whether or not you have things set up properly, open an MS-DOS window by traversing the Start->Programs->MS-DOS Prompt menu items. If you have the Sun JDK installed, you can type these commands:

C:> path C:\java\bin;%path%
C:> set CLASSPATH=.;c:\java\lib\classes.zip

If Symantec Cafe version 1.2 or later is installed, you can type these commands:

C:> path C:\cafe\java\bin;%path%

The class path should already be set up in a file called sc.ini in the bin directory of Cafe.

Next, type the java install command from above. The install program will ask you what directory you wish to install into, and the Jack subdirectory will be created below that.

Using Jack

Jack is written entirely in Java, so having the Jack classes means that this tool is instantly available on every platform that supports the Java virtual machine. However, it also means that on Windows boxes you have to run Jack from the command line. Let's say you chose the directory name JavaTools when you installed Jack on your system. To use Jack you will need to add Jack's classes to your class path. You can do this in your autoexec.bat file or in your .cshrc file if you are a Unix user. The critical command is something like the line shown below:

C:> set CLASSPATH=.;C:\JavaTools\Jack\java;C:\java\lib\classes.zip

Note that Symantec Cafe users can edit the sc.ini file and include the Jack classes there, or they can set CLASSPATH explicitly as shown above.

Setting the environment variable as shown above puts the Jack classes in the CLASSPATH between "." (the current directory) and the base system classes for Java. The main class for Jack is COM.sun.labs.jack.Main. Capitalization is important! There are exactly four capital letters in the command ('C', 'O', 'M', and another 'M'). To run Jack manually, type the command:

C:> java COM.sun.labs.jack.Main parser-input.jack

If you don't have the Jack files in your class path, you can use this command:

C:> java -classpath .;C:\JavaTools\Jack\java;c:\java\lib\classes.zip COM.sun.labs.jack.Main parser-input.jack

As you can see, this gets a bit long. To minimize typing, I put the invocation into a .bat file named Jack.bat. At some point in the future, a simple C wrapper program will become available, perhaps even as you read this. Check out the Jack home page for the availability of this and other programs.

When Jack is run, it creates several files in the current directory that you will later compile into your parser. Most are either prefixed with the name of your parser or are common to all parsers. One of these, however, ASCII_CharStream.java, may collide with other parsers, so it is probably a good idea to start off in a directory containing only the .jack file you are going to use to generate the parser.

Once you run Jack, if the generation has gone smoothly you will have a bunch of .java files in the current directory with various interesting names. These are your parsers. I encourage you to open them up with an editor and look them over. When you are ready you can compile them with the command

C:> javac -d . 
ParserName.java

where ParserName is the name you gave your parser in the input file. More on that in a bit. If all of the files for your parser don't compile, you can use the brute force method of typing:

C:> javac *.java

This will compile everything in the directory. At this point your new parser is ready to use.

Jack parser descriptions

Jack parser description files have the extension .jack and are divided into three basic parts: options and base class; lexical tokens; and non-terminals. Let's look at a simple parser description (this is included in the examples directory that comes with Jack).

options {
    LOOKAHEAD = 1;
}
PARSER_BEGIN(Simple1)
public class Simple1 {
    public static void main(String args[]) throws ParseError {
    Simple1 parser = new Simple1(System.in);
    parser.Input();
    }
}
PARSER_END(Simple1)

The first few lines above describe options for the parser; in this case LOOKAHEAD is set to 1. There are other options, such as diagnostics, Java Unicode handling, and so on, that also can be set here. Following the options comes the base class of the parser. The two tags PARSER_BEGIN and PARSER_END bracket the class that becomes the base Java code for the resulting parser. Note that the class name used in the parser specification must be the same in the beginning, middle, and ending part of this section. In the example above, I've put the class name in bold face to make this clear. As you can see in the code above, this class defines a static main method so that the class can be invoked by the Java interpreter on the command line. The main method simply instantiates a new parser with an input stream (in this case System.in) and then invokes the Input method. The Input method is a non-terminal in our grammar, and it is defined in the form of an EBNF element. EBNF stands for Extended Backus-Naur Form. The Backus-Naur form is a method for specifying context-free grammars. The specification consists of a terminal on the left-hand side, a production symbol, which is typically "::=", and one or more productions on the right-hand side. The notation used is typically something like this:

 Keyword ::= "if" |
"then" | "else"

This would be read as, "The Keyword terminal is one of the string literals 'if', 'then', or 'else.'" In Jack, this form is extended to allow the left-hand part to be represented by a method, and the alternate expansions may be represented by regular expressions or other non-terminals. Continuing with our simple example, the file contains the following definitions:

void Input() :
{}
{
    MatchedBraces() "\n" <EOF>
}
void MatchedBraces() :
{}
{
    "{" [ MatchedBraces() ] "}"
}

This simple parser parses the grammar shown below:

Input::=MatchedBraces "\n" <EOF>
MatchedBraces::="{" [ MatchedBraces ] "}"

I've used italics to show the non-terminals on the right side of the productions and boldface to show literals. As you can see, the grammar simply parses matched sets of brace "{" and "}" characters. There are two productions in the Jack file to describe this grammar. The first terminal, Input, is defined by this definition to be three items in sequence: a MatchedBraces terminal, a newline character, and an end-of-file token. The <EOF> token is defined by Jack so that you don't have to specify it for your platform.

When this grammar is generated, the left-hand sides of the productions are turned into methods inside the Simple1 class; when compiled, the Simple1 class reads characters from System.in and verifies that they contain a matching set of braces. This is accomplished by invoking the generated method Input, which is transformed by the generation process into a method that parses an Input non-terminal. If the parse fails, the method throws the exception ParseError, which the main routine can catch and then complain about if it so chooses.

Of course there's more. The block delineated by "{" and "}" after the terminal name -- which is empty in this example -- can contain arbitrary Java code that is inserted at the front of the generated method. Then, after each expansion, there is another optional block that can contain arbitrary Java code to be executed when the parser successfully matches that expansion.

A more complicated example

So how about an example that's a bit more complicated? Consider the following grammar, again broken down into pieces. This grammar is designed to interpret mathematical equations using the four basic operators -- addition, multiplication, subtraction, and division. The source can be found here:

options {
    LOOKAHEAD=1;
}
PARSER_BEGIN(Calc1)
public class Calc1 {
    public static void main(String args[]) throws ParseError {
    Calc1 parser = new Calc1(System.in);
    while (true) {
        System.out.print("Enter Expression: ");
        System.out.flush();
        try {
        switch (parser.one_line()) {
            case -1:
            System.exit(0);
            default:
            break;
        }
        } catch (ParseError x) {
        System.out.println("Exiting.");
        throw x;
        }
    }
    }
}
PARSER_END(Calc1)

The first part is nearly the same as Simple1, except that the main routine now calls the terminal one_line repeatedly until it fails to parse. Next comes the following code:

IGNORE_IN_BNF :
{}
{
    " " | "\r" | "\t"
}
TOKEN :
{ }
{
    < EOL: "\n" >
}
TOKEN : /* OPERATORS */
{ }
{
    < PLUS: "+" >
|   < MINUS: "-" >
|   < MULTIPLY: "*" >
|   < DIVIDE: "/" >
}
TOKEN :
{ }
{
    < CONSTANT: ( <DIGIT> )+ >
|   < #DIGIT: ["0" - "9"] >
}

These definitions cover the basic terminals in which the grammar is specified. The first, named IGNORE_IN_BNF, is a special token. Any tokens read by the parser that match the characters defined in an IGNORE_IN_BNF token are silently discarded. As you can see in our example, this causes the parser to ignore space characters, tabs, and carriage return characters in the input.

1 2 3 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more