Featured Whitepapers
Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Build your own languages with JavaCC

JavaCC makes it a snap to write your own compiler or interpreter for languages of your own design in Java

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Do you ever wonder how the Java compiler works? Do you need to write parsers for markup documents that do not subscribe to standard formats such as HTML or XML? Or do you want to implement your own little programming language just for the heck of it? JavaCC allows you to do all of that in Java. So whether you're just interested in learning more about how compilers and interpreters work, or you have concrete ambitions of creating the successor to the Java programming language, please join me on this month's quest to explore JavaCC, highlighted by the construction of a handy little command-line calculator.

Compiler construction fundamentals

Programming languages are often divided, somewhat artificially, into compiled and interpreted languages, although the boundaries have become blurred. As such, don't worry about it. The concepts discussed here apply equally well to compiled as well as interpreted languages. We will use the word compiler below, but for the scope of this article, that shall include the meaning of interpreter.

Compilers have to perform three major tasks when presented with a program text (source code):

  1. Lexical analysis
  2. Syntactic analysis
  3. Code generation or execution


The bulk of the compiler's work centers around steps 1 and 2, which involve understanding the program source code and ensuring its syntactical correctness. We call that process parsing, which is the parser's responsibility.

Lexical analysis (lexing)

Lexical analysis takes a cursory look at the program source code and divides it into proper tokens. A token is a significant piece of a program's source code. Token examples include keywords, punctuation, literals such as numbers, and strings. Nontokens include white space, which is often ignored but used to separate tokens, and comments.

Syntactic analysis (parsing)

During syntactic analysis, a parser extracts meaning from the program source code by ensuring the program's syntactical correctness and by building an internal representation of the program.

Computer language theory speaks of programs, grammar, and languages. In that sense, a program is a sequence of tokens. A literal is a basic computer language element that cannot be further reduced. A grammar defines rules for building syntactically correct programs. Only programs that play by the rules defined in the grammar are correct. The language is simply the set of all programs that satisfy all your grammar rules.

During syntactic analysis, a compiler examines the program source code with respect to the rules defined in the language's grammar. If any grammar rule is violated, the compiler displays an error message. Along the way, while examining the program, the compiler creates an easily processed internal representation of the computer program.

A computer language's grammar rules can be specified unambiguously and in their entirety with the EBNF (Extended Backus-Naur-Form) notation (for more on EBNF, see Resources). EBNF defines grammars in terms of production rules. A production rule states that a grammar element -- either literals or composed elements -- can be composed of other grammar elements. Literals, which are irreducible, are keywords or fragments of static program text, such as punctuation symbols. Composed elements are derived by applying production rules. Production rules have the following general format:

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources