Looking for lex and yacc for Java? You don't know Jack

How to get and use Sun's free automatic Java parser generator -- a unique new tool that's a must for Java compiler developers

1 2 3 Page 3
Page 3 of 3
TOKEN : /* A Floating point numeric literal */
{ }
{
    < CONSTANT: <FLOAT>
        | <FLOAT> ( ["e","E"] ([ "-","+"])? <INTEGER> )?
        >
  |     < #FLOAT: <INTEGER> 
        | <INTEGER> ( "." <INTEGER> )?
        | "." <INTEGER>
        >
  |     < #INTEGER: ( <DIGIT> )+ >
  |     < #DIGIT: ["0" - "9"] >
}

Reading from the bottom of the definition to the top, you will first note that the previous definition of CONSTANT has been rechristened as INTEGER. The definition of INTEGER is used to create a new definition for a token named FLOAT. The FLOAT token is matched when the lexical analyzer takes one of the three possible choices:

  • The first choice is simply an INTEGER token, or in plain English, a whole number. This choice matches numbers of the form 1, 2, 345, and so on.

  • The second choice is an INTEGER, optionally followed by a period and another INTEGER. This choice matches numbers of the form 1.0, 999.112, and so on.

  • The final choice is an INTEGER that is immediately preceded by a decimal point (.). This choice matches numbers of the form .1, .355, and so on.

However, since representing extremely large numbers or extremely small numbers is cumbersome using just the floating-point notation, the new definition of the CONSTANT token defines numbers as being optionally represented in scientific notion. Numbers represented by scientific notation are typically written with a floating-point mantissa and an exponent. CONSTANT is defined to match the sequence: a FLOAT token, the letter "e" or "E", optionally the symbol plus (+) or minus(-), and finally an INTEGER token representing the exponent. This representation for numbers was chosen because it matches the numbers that the method valueOf in the Double class can parse.

The other significant change in the new grammar was the addition of an infix operator '**' to represent exponentiation. This operator is right associative so that 2**2**2 is parsed as (2**(2**2)). To achieve right associativity we can use a right-recursive definition. The EBNF for the exp production is as follows:

void exp() :
{ }
{
    unary() ( LOOKAHEAD( <EXP> ) <EXP> exp() )*
}

This production is inserted between the term production and the unary production. Note that on the right-hand side, the production refers to itself -- thus each match of the expansion "< EXP> exp()" means that right-hand expansions have already been parsed. While not strictly necessary for this grammar, the "Lookahead( <EXP> )" command tells Jack to look ahead in the incoming stream to see if there is an exponentiation operator "next" in line to be parsed. If there is, then this expansion is taken and the following token must match an exp terminal, otherwise a ParseError is thrown. If the next token in the stream that will be read is not an <EXP> token (the lookahead failed), then the parser can safely ignore this expansion since it requires the <EXP> to be a possible match. By carefully using Lookahead, ambiguities in the grammar can be distinguished.

Exercises

If you've read through this article, and you think you understand how all of this works but would like to test your knowledge, here is a solved problem for you. Try to augment the Calc2 grammar so that it will accept functions like sin and cos. The BNF for this can be written as follows:

function ::= <function-name> "(" [ <expression> ("," <expression>)* ] "}"

In English a function is a function name followed by an open parenthesis, followed by zero or more expressions separated by commas, followed by a close parenthesis. You can see my solution in Calc3. For extra credit, parse the function as a lexical token. What does that change and why?

Final thoughts and conclusions

Jack is an extremely useful tool to add to the Java arsenal. With a tool like Jack, experimenting with language design becomes fairly straightforward. As a way to validate the design and implementation of Jack, the designers include a grammar for the Java language itself. This grammar can be used as the basis for building any tools that might need to parse Java files in its operation. Using the supplied Java grammar as a starting point, writing your own version of javadoc, for example, becomes much easier. Experimenting with other syntactic constructs becomes easier as well. Finding the Java grammar in the distribution can be tough though; it isn't in the examples or in the contrib directory. It is down in the class directory in Jack\java\COM\sun\labs\javaparser.

Jack includes another useful package -- JackDoc. This tool (written using Jack, of course) parses a Jack file and produces an HTML page with the grammar. Here is a sample of the Calc3 grammar, post-processed into HTML by JavaDoc. This is very useful for documentation except for the fact that it doesn't include the definition of terminals in the resulting grammar; only non-terminals are included. Still, for turning Jack files into formatted BNF, it can't be beat.

The initial release of Jack, version 0.5, indicates that the authors are about "halfway" to a finished product. Don't let the less-than-1.0 version number fool you; this is a tool that is very useful as it stands now. Experience with Jack will no doubt guide its authors, as will comments from you. Send these to jack-interest@asap.eng.sun.com.

It is important to note that Jack is not very efficient in its generation of code for simple grammars. My simple calculator grammar generated nearly 35 kilobytes of Java source code, much of it general-purpose parsing code. Adding another pair of operators only increased the source code by about 6 percent and the runtime code by slightly more than 4 percent. More importantly, however, my modified grammar worked the first time, and I have a high degree of confidence in its correctness. Handcrafted grammars can be achieved with fewer lines of code, but the value of having code that both works and is correct cannot be underestimated.

Chuck McManis is currently the director of system software at FreeGate Corp. FreeGate is a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, McManis was a member of the Java group. He joined the Java group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" homepage on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft developing networking applications, where he did the initial design of NIS+.

Learn more about this topic

1 2 3 Page 3
Page 3 of 3