Page 2 of 4
GRAMMAR_ELEMENT := list of grammar elements
| alternate list of grammar elements
As an example, let's look at the grammar rules for a small language that describes basic arithmetic expressions:
expr := number
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '(' expr ')'
| - expr
number := digit+ ('.' digit+)?
digit := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Three production rules define the grammar elements:
exprnumberdigitThe language defined by that grammar allows us to specify arithmetic expressions. An expr is either a number or one of the four infix operators applied to two exprs, an expr in parenthesis, or a negative expr. A number is a floating-point number with optional decimal fraction. We define a digit to be one of the familiar decimal digits.
Once the parser successfully parses the program without error, it exists in an internal representation that is easy to process by the compiler. It is now relatively easy to generate machine code (or Java bytecode for that matter) from the internal representation or to execute the internal representation directly. If we do the former, we are compiling; in the latter case, we talk about interpreting.
JavaCC, available for free, is a parser generator. It provides a Java language extension for specifying a programming language's
grammar. JavaCC was developed initially by Sun Microsystems, but it's now maintained by MetaMata. Like any decent programming tool, JavaCC was actually used to specify the grammar of the JavaCC input format.
Moreover, JavaCC allows us to define grammars in a fashion similar to EBNF, making it easy to translate EBNF grammars into the JavaCC format. Further, JavaCC is the most popular parser generator for Java, with a host of predefined JavaCC grammars available to use as a starting point.
We now revisit our little arithmetic language to build a simple command-line calculator in Java using JavaCC. First, we have to translate the EBNF grammar into JavaCC format and save it in the file Arithmetic.jj:
options
{
LOOKAHEAD=2;
}
PARSER_BEGIN(Arithmetic)
public class Arithmetic
{
}
PARSER_END(Arithmetic)
SKIP :
{
" "
| "\r"
| "\t"
}
TOKEN:
{
< NUMBER: (<DIGIT>)+ ( "." (<DIGIT>)+ )? >
| < DIGIT: ["0"-"9"] >
}
double expr():
{
}
{
term() ( "+" expr() | "-" expr() )*
}
double term():
{
}
{
unary() ( "*" term() | "/" term() )*
}
double unary():
{
}
{
"-" element() | element()
}
double element():
{
}
{
<NUMBER> | "(" expr() ")"
}
The code above should give you an idea on how to specify a grammar for JavaCC. The options section at the top specifies a set of options for that grammar. We specify a lookahead of 2. Additional options control JavaCC's debugging features and more. Those options can alternatively be specified on the JavaCC command line.
The PARSER_BEGIN clause specifies that the parser class definition follows. JavaCC generates a single Java class for each parser. We call the parser class Arithmetic. For now, we require only an empty class definition; JavaCC will add parsing-related declarations to it later. We end the class definition with the PARSER_END clause.
The SKIP section identifies the characters we want to skip. In our case, those are the white-space characters. Next, we define the
tokens of our language in the TOKEN section. We define numbers and digits as tokens. Note that JavaCC differentiates between definitions for tokens and definitions for other production rules, which differs from EBNF. The SKIP and TOKEN sections specify this grammar's lexical analysis.
Next, we define the production rule for expr, the top-level grammar element. Notice how that definition markedly differs from the definition of expr in EBNF. What's happening? Well, it turns out that the EBNF definition above is ambiguous, as it allows multiple representations
of the same program. For example, let us examine the expression 1+2*3. We can match 1+2 into an expr yielding expr*3, as in Figure 1.

Figure 1. EBNF parse tree of 1+2*3
Or, alternatively, we could first match 2*3 into an expr resulting in 1+expr, as shown in Figure 2.

Figure 2. Alternative EBNF parse tree of 1+2*3
With JavaCC, we have to specify the grammar rules unambiguously. As a result, we break out the definition of expr into three production rules, defining the grammar elements expr, term, unary, and element. Now, the expression 1+2*3 is parsed as shown in Figure 3.

Figure 3. Parse tree of 1+2*3
From the command line we can run JavaCC to check our grammar:
javacc Arithmetic.jj Java Compiler Compiler Version 1.1 (Parser Generator) Copyright (c) 1996-1999 Sun Microsystems, Inc. Copyright (c) 1997-1999 Metamata, Inc. (type "javacc" with no arguments for help) Reading from file Arithmetic.jj . . . Warning: Lookahead adequacy checking not being performed since option LOOKAHEAD is more than 1. Set option FORCE_LA_CHECK to true to force checking. Parser generated with 0 errors and 1 warnings.
The following checks our grammar definition for problems and generates a set of Java source files:
TokenMgrError.java ParseException.java Token.java ASCII_CharStream.java Arithmetic.java ArithmeticConstants.java ArithmeticTokenManager.java
Together these files implement the parser in Java. You can invoke this parser by instantiating an instance of the Arithmetic class:
public class Arithmetic implements ArithmeticConstants
{
public Arithmetic(java.io.InputStream stream) { ... }
public Arithmetic(java.io.Reader stream) { ... }
public Arithmetic(ArithmeticTokenManager tm) { ... }
static final public double expr() throws ParseException { ... }
static final public double term() throws ParseException { ... }
static final public double unary() throws ParseException { ... }
static final public double element() throws ParseException { ... }
static public void ReInit(java.io.InputStream stream) { ... }
static public void ReInit(java.io.Reader stream) { ... }
public void ReInit(ArithmeticTokenManager tm) { ... }
static final public Token getNextToken() { ... }
static final public Token getToken(int index) { ... }
static final public ParseException generateParseException() { ... }
static final public void enable_tracing() { ... }
static final public void disable_tracing() { ... }
}
If you wanted to use this parser, you must create an instance using one of the constructors. The constructors allow you to
pass in either an InputStream, a Reader, or an ArithmeticTokenManager as the source of the program source code. Next, you specify the main grammar element of your language, for example: