Jul 1, 1997 1:00 AM PT

Build an interpreter in Java -- Implement the execution engine

Here's how to take the interpreter classes and run with them

For those of you just joining us, in my "Java In Depth" column over the past couple months, I've been discussing how one might go about building an interpreter in Java. In the first interpreter column, we covered some of the desirable attributes of an interpreter; in the second column we discussed both parsing and the layout of a class package for implementing the interpreter. In this column we'll look at running the interpreter, and the support classes necessary to accomplish that. Finally, I'll wrap up the series here with a discussion of how an interpreter can be hooked into other Java classes, thus enhancing their abilities.

Reviewing the relevant bits

Let me start by sketching out what we've covered so far and point out those parts of the design that will become more important as we discuss execution mode. For a more detailed description of these classes, refer to my previous columns or to the source code links that are in the

Resources section

below.

There are three foundation classes in the implementation of the interpreter, Program, Statement, and Expression. The following shows how the three are related:

Program

This Program class glues together the parsing and execution components of the parser. This class defines two principal methods, load and run. The load method reads statements from an input stream and parses them into a collection of statements, the run method iterates over the collection and executes each of the statements. The Program class also provides a collection of variables for the program to use as well as a stack for storing data.

Statement

The Statement class contains a single parsed statement. This class is actually subclassed into a specific type of statement (PRINT, GOTO, IF, and so on) but all statements contain the method execute which is called to execute the statement in the context of a Program class instance.

Expression

The Expression class contains the parse tree of an expression. During execution, the value method is used to evaluate the expression and return its value. Like Statement, the Expression class is primarily designed to be subclassed by specific types of expressions.

All of these classes work together to form the basis of an interpreter. The Program class simultaneously encapsulates the parsing operation and execution operation, whereas the Statement and Expression classes encapsulate the actual computational concepts of the language we've implemented. For these three articles on building interpreters, the example language has been BASIC.

Facilities for computation

There are two executable classes in the interpreter,

Statement

and

Expression

. First let's take a look at

Expression

.

The instances of Expression are created by the method expression in the class ParseExpression. The ParseExpression class implements the expression parser in this interpreter. This class is a peer of the ParseStatement class, which uses the statement method to parse BASIC statements. Instances of Expression have an internal type that identifies what operator the instance represents, and two methods, value and stringValue, that return the computed value of the expression. Additionally, when an Expression instance is created, it is nominally given two parameters representing the left and right side of the expression's operation. Shown in source form, the first part of the Expression is as follows:

class Expression {
    Expression arg1, arg2;
    int oper;
    final static int OP_ADD  = 1;   // Addition '+'
    final static int OP_SUB  = 2;   // Subtraction '-'
    final static int OP_MUL  = 3;   // Multiplication '*'
    final static int OP_DIV  = 4;   // Division '/'
    [... etc for all of the operators ...]
    final static int OP_BNOT = 19;  // Boolean negation '.NOT.'
    final static int OP_NEG  = 20;  // Unary minus

As you can see in the code above, there are the instance variables, an operator type named oper, and two halves of the operation in arg1 and arg2, and then some constants to define the various types. Also, there are two constructors that are used by the expression parser; these create a new expression from two expressions. Their source is shown below:

protected Expression(int op, Expression a, Expression b) throws BASICSyntaxError {
        arg1 = a;
        arg2 = b;
        oper = op;
        /*
         * If the operator is a boolean, both arguments must be boolean.
         */
        if (op > OP_GE) {
            if ( (! (arg1 instanceof BooleanExpression)) ||
                 (! (arg2 instanceof BooleanExpression)) )
                throw new BASICSyntaxError(typeError);
        } else {
            if ((arg1 instanceof BooleanExpression) || (arg2 instanceof BooleanExpression))
                throw new BASICSyntaxError(typeError);
        }
    }
    protected Expression(int op, Expression a) throws BASICSyntaxError {
        arg2 = a;
        oper = op;
        if ((oper == OP_BNOT) && (! (arg2 instanceof BooleanExpression)))
            throw new BASICSyntaxError(typeError);
    }

The first constructor builds an arbitrary expression object, and the second one builds a "unary" expression object -- such as unary minus. One thing to note is that if there is only one argument, arg2 is used to store its value.

The method that is used the most frequently in the Expression class is value, which is defined as follows:

        double value(Program pgm) throws BASICRuntimeError {
        switch (oper) {
            case OP_ADD :
                return arg1.value(pgm) + arg2.value(pgm);
            case OP_SUB :
                return arg1.value(pgm) - arg2.value(pgm);
    ... etc for all of the other operator types. ...

You can see that each expression object represents a tuple consisting of an operator and one or two arguments. The fun part of designing the expression execution engine in this way is that when you construct this set of expression tuples based on the Expression object, you can compute the value of the expression by simply invoking the value method. The value method recursively invokes the value method of the two arguments that compose this expression, applies the operation to them, and returns the result. This design was used so that expressions would be easy to understand.

To keep the class structure clean, all computational units -- from constants to trigonometric functions -- are subclasses of Expression. This idea, stolen shamelessly from Lisp, completely encapsulates the notion of "causing" an evaluation to occur, from the actual implementation of "how" that evaluation occurs. To demonstrate how this principle is applied, we need only look at some of the specialized subclasses of Expression.

Constants in my version of BASIC, which I named COCOA are represented by the class ConstantExpression, which subclasses Expression and simply stores the numeric value in a member value. The source code to ConstantExpression is shown conceptually below. I say "conceptually" because I did choose to bundle what would have been StringConstantExpression and NumericConstantExpression into a single class. So the real class includes a constructor for creating a constant with a string argument and for returning its value as a string. The following code shows how the ConstantExpression class handles numeric constants.

class ConstantExpression extends Expression {
    private double v;
    ConstantExpression(double a) {
        super();
        v = a;
    }
    double value(Program pgm) throws BASICRuntimeError {
        return v;
    }
}

The code shown above replaces the more complicated constructors of Expression with a simple store of an instance variable; the value method is simply replaced with a return of the stored value.

It is true that you could code the Expression class to accept in its constructors a constant that would save you a class. However, one advantage of designing Expression the way I have is that the code in Expression remains maximally generic. I find this style of coding helps me eliminate the complexity of special cases and thus when I am "done" with the Expression code, I can move on to other aspects of expressions without revisiting the base class again and again. The advantage becomes clearer when we delve into another subclass of Expression named FunctionExpression.

In the class FunctionExpression, there were two design requirements I felt should be met to keep the interpreter flexible. The first was to implement the standard BASIC functions; the other was to encapsulate the parsing of the functions arguments into the same class that implemented these functions. The second requirement, parsing, was motivated by a desire to make this BASIC extensible by creating additional function libraries that could be passed to the parser as subclasses of FunctionExpression. Further, those passed-in classes could be used by the parser to increase the number of functions available to the user's program.

The FunctionExpression class is only moderately more complicated than a ConstantExpression and is shown in condensed form below:

 1 class FunctionExpression extends Expression {
 2     [ ... parsing stuff removed ...]
 3    double value(Program p) throws BASICRuntimeError {
 4        try {
 5            switch (oper) {
 6                case RND :
 7                    if (r == null)
 8                        r = p.getRandom();
 9                    return (r.nextDouble() * arg2.value(p));
10                case INT :
11                    return Math.floor(arg2.value(p));
12                case SIN :
13                    return Math.sin(arg2.value(p));
14    [ ... and so on for all of the functions implemented ... ]
15                default :
16                    throw new BASICRuntimeError("Unknown or non-numeric function.");
17            }
18        } catch (Exception e) {
19            if (e instanceof BASICRuntimeError)
20                throw (BASICRuntimeError) e;
21            else
22                throw new BASICRuntimeError("Arithmetic Exception.");
23        }
24    }

The above source shows how the value method is implemented. The oper variable is reused to hold the function identity, and the Expression objects referenced by arg1 and arg2 are used as arguments to the functions themselves. Finally, there is a large switch statement that dispatches the request. One interesting aspect is that the value method does catch the potential arithmetic exceptions and converts them into instances of BASICRuntimeError. The parsing code in FunctionExpression is shown below, again condensed to save space. (Remember, all of the source code is available using links in the Resources section.)

 1   static FunctionExpression parse(int ty, LexicalTokenizer lt) throws BASICSyntaxError {
 2       FunctionExpression result;
 3       Expression a;
 4       Expression b;
 5       Expression se;
 6       Token t;
 7
 8       t = lt.nextToken();
 9       if (! t.isSymbol('(')) {
10           if (ty == RND) {
11               lt.unGetToken();
12               return new FunctionExpression(ty, new ConstantExpression(1));
13           } else if (ty == FRE) {
14               lt.unGetToken();
15               return new FunctionExpression(ty, new ConstantExpression(0));
16           }
17           throw new BASICSyntaxError("Missing argument for function.");
18       }
19       switch (ty) {
20           case RND:
21           case INT:
22           case SIN:
23           case COS:
24           case TAN:
25           case ATN:
26           case SQR:
27           case ABS:
28           case CHR:
29           case VAL:
30           case STR:
31           case SPC:
32           case TAB:
33           case LOG:
34               a = ParseExpression.expression(lt);
35               if (a instanceof BooleanExpression) {
36                   throw new BASICSyntaxError(functions[ty].toUpperCase()+" function cannot accept boolean expression.");
37               }
38               if ((ty == VAL) && (! a.isString()))
39                   throw new BASICSyntaxError(functions[ty].toUpperCase()+" requires a string valued argument.");
40               result = new FunctionExpression(ty, a);
41               break;
       [ ... other cases omitted for brevity ... ]
42           default:
43               throw new BASICSyntaxError("Unknown function on input.");
44
45       }
46       t = lt.nextToken();
47       if (! t.isSymbol(')')) {
48           throw new BASICSyntaxError("Missing closing parenthesis for function.");
49       }
50       return result;
51   }

Note that this code takes advantage of the fact that the expression parser in ParseStatement has already figured out that it is looking at an expression and has passed the identity of the expression in as parameter ty. This parser then need only locate the opening parenthesis and the closing parenthesis that contain the argument(s). But look carefully: In lines #9 through #18, the parser allows some functions to have no arguments (in this case RND and FRE). This demonstrates the flexibility afforded by having the function subparser built into this class, rather than forcing all functions to conform to some predefined template. Given a function type in the parameter ty, the switch statement selects a branch that can parse the arguments required for that function, be they strings, numbers, other expressions, and so on.

Other aspects: Strings and arrays

Two other parts of the BASIC language are implemented by the COCOA interpreter: strings and arrays. Let's look at the implementation of strings first.

To implement strings as variables, the Expression class was modified to include the notion of "string" expressions. This modification took the form of two additions: isString and stringValue. The source for these two new methods is shown below.

        String stringValue(Program pgm) throws BASICRuntimeError {
        throw new BASICRuntimeError("No String representation for this.");
    }
    boolean isString() { return false; }

Clearly, it isn't too useful to a BASIC program to get the string value of a base expression (which is always either numeric or boolean expression). You might conclude from the lack of utility that these methods then did not belong in Expression and belonged in a subclass of Expression instead. However, by putting these two methods in the base class, all Expression objects can be tested to see if, in fact, they are strings.

Another design approach is to return the numeric values as strings using a StringBuffer object to generate a value. So, for example, the same code could be rewritten as:

        String stringValue(Program pgm) throws BASICRuntimeError {
    StringBuffer sb = new StringBuffer();
        sb.append(this.value(pgm));
    return sb.toString();
    }

And if the code above is used, you can eliminate the use of isString because every expression can return a string value. Further, you can modify the value method to try to return a number if the expression evaluates to a string by running it through the valueOf method of java.lang.Double. In many languages such as Perl, TCL, and REXX, this sort of amorphous typing is used to great advantage. Both approaches are valid, and you should make your choice based on the design of your interpreter. In BASIC, the interpreter needs to return an error when a string is assigned to a numeric variable, so I chose the first approach (returning an error).

As for arrays, there are different ways in which you can design your language to interpret them. C uses the square brackets around array elements to distinguish the array's index references from function references that have parentheses around their arguments. However, the language designers for BASIC chose to use parentheses for both functions and arrays so when the text NAME(V1, V2) is seen by the parser, it could be either a function call or an array reference.

The lexical analyzer discriminates between tokens that are followed by parentheses by first assuming they are functions and testing for that. Then it goes on to see if they are keywords or variables. It is this decision that prevents your program from defining a variable named "SIN." Any variable whose name matched a function name would be returned by the lexical analyzer as a function token instead. The second trick the lexical analyzer uses is to check to see if the variable name is immediately followed by `('. If it is, the analyzer assumes it is an array reference. By parsing this in the lexical analyzer, we eliminate the string `MYARRAY ( 2 )' from being interpreted as a valid array (note the space between the variable name and the open parenthesis).

The final trick to implementing arrays is in the Variable class. This class is used for an instance of a variable, and as I discussed in last month's column, it is a subclass of Token. However, it also has some machinery to support arrays and that is what I'll show below:

class Variable extends Token {
    // Legal variable sub types
    final static int NUMBER = 0;
    final static int STRING  = 1;
    final static int NUMBER_ARRAY = 2;
    final static int STRING_ARRAY = 4;
    String name;
    int subType;
    /*
     * If the variable is in the symbol table these values are
     * initialized.
     */
    int ndx[];  // array indices.
    int mult[]; // array multipliers
    double nArrayValues[];
    String sArrayValues[];

The above code shows the instance variables associated with a variable, as in the ConstantExpression class. One has to make a choice about the number of classes to be used versus the complexity of a class. One design choice might be to build a Variable class that holds only scalar variables and then add an ArrayVariable subclass to deal with the intricacies of arrays. I chose to combine them, turning scalar variables essentially into arrays of length 1.

If you read the above code, you will see array indices and multipliers. These are here because multidimensional arrays in BASIC are implemented using a single linear Java array. The linear index into the Java array is computed manually using the elements of the multiplier array. The indices used in the BASIC program are checked for validity by comparing them to the maximum legal index in the indices' ndx array.

For example, a BASIC array with three dimensions of 10, 10, and 8, would have the values 10, 10, and 8 stored in ndx. This allows the expression evaluator to test for an "index out of bounds" condition by comparing the number used in the BASIC program to the maximum legal number that is now stored in ndx. The multiplier array in our example would contain the values 1, 10, and 100. These constants represent the numbers one uses to map from a multidimensional array index specification into a linear array index specification. The actual equation is:

Java Index = Index1 + Index2 * Max Size of Index1 + Index3 * (MaxSize of Index1 * MaxSizeIndex 2)

The next Java array in the Variable class is shown below.

    Expression expns[];

The expns array is used to deal with arrays that are written as "A(10*B, i)." In that case, the indices are actually expressions rather than constants, so the reference has to contain pointers to those expressions that are evaluated at run time. Finally there is this fairly ugly-looking piece of code that calculates the index depending on what was passed in the program. This private method is shown below.

        private int computeIndex(int ii[]) throws BASICRuntimeError {
        int offset = 0;
        if ((ndx == null) || (ii.length != ndx.length))
            throw new BASICRuntimeError("Wrong number of indices.");
        for (int i = 0; i < ndx.length; i++) {
            if ((ii[i] < 1) || (ii[i] > ndx[i]))
                throw new BASICRuntimeError("Index out of range.");
            offset = offset +  (ii[i]-1) * mult[i];
        }
        return offset;
    }

Looking at the code above, you'll note that the code first checks to see that the correct number of indices were used when referencing the array, and then that each index was within the legal range for that index. If an error is detected, an exception is thrown to the interpreter. The methods numValue and stringValue return a value from the variable as a number or a string respectively. These two methods are shown below.

        double numValue(int ii[]) throws BASICRuntimeError {
        return nArrayValues[computeIndex(ii)];
    }
    String stringValue(int ii[]) throws BASICRuntimeError {
        if (subType == NUMBER_ARRAY)
            return ""+nArrayValues[computeIndex(ii)];
        return sArrayValues[computeIndex(ii)];
    }

There are additional methods for setting the value of a variable that are not shown here.

By hiding much of the complexity of how each piece is implemented, when it finally comes time to execute the BASIC program, the Java code is quite straightforward.

Running the code

The code to interpret the BASIC statements and execute them is contained in the

run

method of the

Program

class. The code for this method is shown below, and I'll step through it to point out the interesting parts.

 1    public void run(InputStream in, OutputStream out) throws BASICRuntimeError {
 2        PrintStream pout;
 3        Enumeration e = stmts.elements();
 4        stmtStack = new Stack();    // assume no stacked statements ...
 5        dataStore = new Vector();   // ...  and no data to be read.
 6        dataPtr = 0;
 7        Statement s;
 8
 9        vars = new RedBlackTree();
10
11        // if the program isn't yet valid.
12        if (! e.hasMoreElements())
13            return;
14
15        if (out instanceof PrintStream) {
16            pout = (PrintStream) out;
17        } else {
18            pout = new PrintStream(out);
19        }

The above code shows that the run method takes an InputStream and an OutputStream for use as the "console" for the executing program. In line 3, the enumeration object e is set to the set of statements from the collection named stmts. For this collection I used a variation on a binary search tree called a "red-black" tree. (For further information on binary search trees, see my previous column on building generic collections.) Following that, two additional collections are created -- one using a Stack and one using a Vector. The stack is used like the stack in any computer, but the vector is used expressly for the DATA statements in the BASIC program. The final collection is another red-black tree that holds the references for the variables defined by the BASIC program. This tree is the symbol table that is used by the program while it is executing.

Following the initialization, the input and output streams are set up, and then if e is not null, we start by collecting any data that has been declared. That is done as shown in the following code.

        /* First we load all of the data statements */
        while (e.hasMoreElements()) {
            s = (Statement) e.nextElement();
            if (s.keyword == Statement.DATA) {
                s.execute(this, in, pout);
            }
        }

The above loop simply looks at all of the statements, and any DATA statements it finds are then executed. The execution of each DATA statement inserts the values declared by that statement into the dataStore vector. Next we execute the program proper, which is done using this next piece of code:

        e = stmts.elements();
        s = (Statement) e.nextElement();
        do {
            int yyy;
            /* While running we skip Data statements. */
            try {
                yyy = in.available();
            } catch (IOException ez) {
                yyy = 0;
            }
            if (yyy != 0) {
                pout.println("Stopped at :"+s);
                push(s);
                break;
            }
            if (s.keyword != Statement.DATA) {
                if (traceState) {
                    s.trace(this, (traceFile != null) ? traceFile : pout);
                }
                s = s.execute(this, in, pout);
            } else
                s = nextStatement(s);
        } while (s != null);
    }

As you can see in the code above, the first step is to reinitialize e. The next step is to fetch the first statement into the variable s and then to enter the execution loop. There is some code to check for pending input on the input stream to allow the progress of the program to be interrupted by typing at the program, and then the loop checks to see if the statement to execute would be a DATA statement. If it is, the loop skips the statement as it was already executed. The rather convoluted technique of executing all data statements first is required because BASIC allows the DATA statements that satisfy a READ statement to appear anywhere in the source code. Finally, if tracing is enabled, a trace record is printed and the very uninpressive statement s = s.execute(this, in, pout); is invoked. The beauty is that all the effort of encapsulating the base concepts into easy-to-understand classes makes the final code trivial. If it isn't trivial then perhaps you have a clue that there might be another way to split your design.

Wrapping up and further thoughts

The interpreter was designed so that it could run as a thread, thus there can be several COCOA interpreter threads running simultaneously in your program space at the same time. Further, with the use of function expansion we can provide a means whereby those threads can interact with each other. There was a program for the Apple II and later for the PC and Unix called C-robots that was a system of interacting "robotic" entities that were programmed using a simple BASIC derivative language. The game provided me and others with many hours of entertainment but was also an excellent way to introduce the basic principles of computation to younger students (who mistakenly believed they were just playing and not learning). Java based interpreter subsystems are much more powerful than their pre-Java counterparts because they are instantly available on any Java platform. COCOA ran on Unix systems and Macintoshes the same day I got working on a Windows 95 based PC. While Java gets beaten up by incompatibilities in the thread or window toolkit implementations, what is often overlooked is this: A lot of code "just works."

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+. Also, check out his home page. :END_BIO

Learn more about this topic