How to build an interpreter in Java, Part 1: The BASICs

For complex applications requiring a scripting language, Java can be used to implement the interpreter, adding scripting abilities to any Java app

When I told a friend that I had written a BASIC interpreter in Java, he laughed so hard he nearly spilled the soda he was holding all over his clothes. "Why in the world would you build a BASIC interpreter in Java?" was the predictable first question out of his mouth. The answer is both simple and complex. The simple response is that it was fun to write an interpreter in Java, and if I were going to write an interpreter, I might as well write one about which I have fond memories from the early days of personal computing. On the complex side, I've noticed that many people using Java today have gotten past the point of creating tumbling Duke applets and are moving on to serious applications. Often, in building an application, you would like it to be configurable. The mechanism of choice for re-configuration is some sort of dynamic execution engine.

Known as macro languages, or configuration languages, dynamic execution is the feature that allows an application to be "programmed" by the user. The benefit of having a dynamic execution engine is that tools and applications can be customized to perform complex tasks without replacing the tool. The Java platform offers a wide variety of dynamic execution engine options.

HotJava and other hot options

Let's explore briefly some of the dynamic execution engine options available and then look at the implementation of my interpreter in depth. A dynamic execution engine is an embedded interpreter. An interpreter requires three facilities in order to operate:

  1. A means of being loaded with instructions
  2. A module format, for storing instructions to be executed
  3. A model or environment for interacting with the host program

HotJava

The most famous embedded interpreter has to be the HotJava "applet" environment that has completely reshaped the way people look at Web browsers.

The HotJava "applet" model was based on the notion that a Java application could create a generic base class with a known interface, and then dynamically load subclasses of that class and execute them at run time. These applets provided new capabilities and, within the confines of the base class, provided dynamic execution. This dynamic execution capability is a fundamental part of the Java environment and one of the things that makes it so special. We'll look at this particular environment in depth in a later column.

GNU EMACS

Before HotJava arrived, perhaps the most successful application with dynamic execution was GNU EMACS. This editor's LISP-like macro language has become a staple for many programmers. Briefly, the EMACS LISP environment consists of a LISP interpreter and many editing-type functions that can be used to compose the most complex macros. It should not be considered surprising that the EMACS editor originally was written in macros designed for an editor called TECO. Thus, the availability of a rich (if unreadable) macro language in TECO allowed an entirely new editor to be constructed. Today, GNU EMACS is the base editor, and entire games have been written in nothing more than the EMACS LISP code, known as el-code. This configuration ability has made GNU EMACS a mainstay editor, while the VT-100 terminals it was designed to run on have become mere footnotes in a writer's column.

REXX

One of my favorite languages, which never quite made the splash it deserved, was REXX, designed by Mike Cowlishaw of IBM. The company needed a language to control applications on large mainframes running the VM operating system. I discovered REXX on the Amiga where it was tightly coupled with a wide variety of applications through "REXX ports." These ports allowed applications to be driven remotely via the REXX interpreter. This coupling of interpreter and application created a much more powerful system than was possible with its component parts. Fortunately, the language lives on in NETREXX, a version Mike wrote that was compiled into Java code.

As I was looking at NETREXX and a much earlier language (LISP in Java), it struck me that these languages formed important parts of the Java application story. What better way to tell this part of the story than to do something fun here -- like resurrect BASIC-80? More importantly, it would be useful to show one way in which scripting languages can be written in Java and, through their integration with Java, show how they can enhance the capabilities of your Java applications.

BASIC requirements for enhancing your Java apps

BASIC is, quite simply, a basic language. There are two schools of thought on how one might go about writing an interpreter for it. One approach is to write a programming loop in which the interpreter program reads one line of text from the interpreted program, parses it, and then calls a subroutine to execute it. The sequence of reading, parsing, and executing is repeated until one of the interpreted program's statements tells the interpreter to stop.

The second and much more interesting way to tackle the project actually is to parse the language into a parse tree and then execute the parse tree "in place." This is how tokenizing interpreters operate and the way I chose to proceed. Tokenizing interpreters also are faster as they don't need to re-scan the input every time they execute a statement.

As I mentioned above, the three components necessary to achieve dynamic execution are a means of being loaded, a module format, and the execution environment.

The first component, a means of being loaded, will be dealt with by a Java InputStream. As input streams are fundamental in the I/O architecture of Java, the system is designed to read in a program from an InputStream and convert it into executable form. This represents a very flexible way of feeding code into the system. Of course, the protocol for the data going over the input stream will be BASIC source code. It is important to note that any language can be used; don't make the mistake of thinking this technique can't be applied to your application.

After the source code of the interpreted program is entered into the system, the system converts the source code into an internal representation. I chose to use the parse tree as the internal representation format for this project. Once the parse tree is created, it can be manipulated or executed.

The third component is the execution environment. As we'll see, the requirements for this component are fairly simple, but the implementation has a couple of interesting twists.

A very quick BASIC tour

For those of you who may never have heard of BASIC, I'll give you a brief glimpse of the language, so you can understand the parsing and execution challenges ahead. For more information on BASIC, I highly recommend the resources at the end of this column.

BASIC stands for Beginners All-purpose Symbolic Instructional Code, and it was developed at Dartmouth University to teach computation concepts to undergraduate students. Since its development, BASIC has evolved into a variety of dialects. The simplest of these dialects are used as control languages for industrial process controllers; the most complex dialects are structured languages that incorporate some aspects of object-oriented programming. For my project, I chose a dialect known as BASIC-80 that was popular on the CP/M operating system in the late seventies. This dialect is only moderately more complex than the simplest dialects.

Statement syntax

All statement lines are of the form

<Line> <Keyword> <Parameters> [ : <Statement> [ : ... ] ]

where "Line" is a statement line number, "Keyword" is a BASIC statement keyword, and "Parameters" are a set of parameters associated with that keyword.

The line number has two purposes: It serves as a label for statements that control execution flow, such as a goto statement, and it serves as a sorting tag for statements inserted into the program. As a sorting tag, the line number facilitates a line editing environment in which editing and command processing are mixed in a single interactive session. By the way, this was required when all you had was a teletype. :-)

While not very elegant, line numbers do give the interpreter environment the ability to update the program one statement at a time. This ability stems from the fact that a statement is a single parsed entity and can be linked in a data structure with line numbers. Without line numbers, often it is necessary to re-parse the entire program when a line changes.

The keyword identifies the BASIC statement. In the example, our interpreter will support a slightly extended set of BASIC keywords, including goto, gosub, return, print, if, end, data, restore, read, on, rem, for, next, let, input, stop, dim, randomize, tron, and troff. Obviously, we won't go over all of these in this article, but there will be some documentation online in my next month's "Java In Depth" for you to explore.

Each keyword has a set of legal keyword parameters that can follow it. For example, the goto keyword must be followed by a line number, the if statement must be followed by a conditional expression as well as the keyword then -- and so on. The parameters are specific to each keyword. I'll cover a couple of these parameter lists in detail a bit later.

Expressions and operators

Often, a parameter specified in a statement is an expression. The version of BASIC I'm using here supports all of the standard mathematical operations, logical operations, exponentiation, and a simple function library. The most important component of the expression grammar is the ability to call functions. The expressions themselves are fairly standard and similar to the ones parsed by the example in my previous StreamTokenizer column.

Variables and data types

Part of the reason BASIC is such a simple language is because it has only two data types: numbers and strings. Some scripting languages, such as REXX and PERL, don't even make this distinction between data types until they are used. But with BASIC, a simple syntax is used to identify data types.

Variable names in this version of BASIC are strings of letters and numbers that always start with a letter. Variables are not case-sensitive. Thus A, B, FOO, and FOO2 are all valid variable names. Furthermore, in BASIC, the variable FOOBAR is equivalent to FooBar. To identify strings, a dollar sign ($) is appended to the variable name; thus, the variable FOO$ is a variable containing a string.

Finally, this version of the language supports arrays using the dim keyword and a variable syntax of the form NAME(index1, index2, ...) for up to four indices.

Program structure

Programs in BASIC start by default at the lowest numbered line and continue until there are either no more lines to process or the stop or end keywords are executed. A very simple BASIC program is shown below:

100 REM This is probably the canonical BASIC example
110 REM Program. Note that REM statements are ignored.
120 PRINT "This is a test program."
130 PRINT "Summing the values between 1 and 100"
140 LET total = 0
150 FOR I = 1 TO 100
160     LET total = total + i
170 NEXT I
180 PRINT "The total of all digits between 1 and 100 is " total
190 END

The line numbers above indicate the lexical order of the statements. When they are run, lines 120 and 130 print messages to the output, line 140 initializes a variable, and the loop in lines 150 through 170 update the value of that variable. Finally, the results are printed out. As you can see, BASIC is a very simple programming language and therefore an ideal candidate for teaching computation concepts.

Organizing the approach

Typical of scripting languages, BASIC involves a program composed of many statements that run in a particular environment. The design challenge, then, is to construct the objects to implement such a system in a useful way.

When I looked at the problem, a straightforward data structure fairly leaped out at me. That structure is as follows:

The public interface to the scripting language shall consist of

  • A factory method that takes source code as input and returns an object representing the program.
  • An environment that provides the framework in which the program executes, including "I/O" devices for text input and text output.
  • A standard way of modifying that object, perhaps in the form of an interface, that allows the program and the environment to be combined to achieve useful results.

Internally, the structure of the interpreter was a bit more complicated. The question was how to go about factoring the two facets of the scripting language, parsing and execution? Three groups of classes resulted -- one for parsing, one for the structural framework of representing parsed and executable programs, and one that formed the base environment class for execution.

In the parsing group, the following objects are required:

  • Lexical analysis for processing the code as text
  • Expression parsing, to construct parse trees of the expressions
  • Statement parsing, to construct parse trees of the statements themselves
  • Error classes for reporting errors in parsing

The framework group consists of objects that hold the parse trees and the variables. These include:

  • A statement object with many specialized subclasses to represent parsed statements
  • An expression object to represent expressions for evaluation
  • A variable object with many specialized subclasses to represent atomic instances of data
Related:
1 2 Page 1
Page 1 of 2