How to build an interpreter in Java, Part 2: The structure

The trick to assembling the foundation classes for a simple interpreter

In last month's column I discussed the idea of building an interpreter in Java and introduced my version of BASIC, named Cocoa. In this month's column we will jump right into the source code for the Cocoa interpreter. However, as the emphasis here is on building interpreters in Java, more attention will be paid to the structure than to the actual code. As always, the full source code for the column is available in the Resources section below.

The Program class

All the constituent classes of the interpreter are collected into a single Java package. For this interpreter, that package is named

basic

. (Here is

Cocoa's programming documentation

.) This does contravene Sun's suggested naming convention for packages for no good reason, but then again Sun's naming scheme, as described in the Java Language spec, isn't particularly well motivated either. :-) However, there is good reason to put the classes together in a single package, and that is to provide visibility of interpreter private methods to other interpreter classes without exposing them to the interpreter's clients.

As I discussed last month, the primary class for the interpreter is a public class with a factory method to create executable programs. I chose to implement this load method in a class named basic.Program. (For the rest of the article we'll assume the class names are all preceded by the "basic" package name without expressly calling it out.) Program exports a static method whose signature is:

public static Program load(InputStream source, PrintStream out)
        throws IOException, BASICSyntaxError { ... }

This method is responsible for getting the text in an InputStream parsed and collected into something that can be interpreted. It returns an instance of a Program object, which is the parsed program. The other important public method in Program is run, whose signature is as follows:

public void run(InputStream in, OutputStream out) throws BASICRuntimeError {
        PrintStream pout;

The run method is not static; it actually runs the instance of the program using the input and output streams that are passed for doing character I/O.

As you can see, both of these methods throw exceptions. The first throws an IOException if an I/O error occurs while reading the source code from the input stream. The load method also can throw a BASICSyntaxError exception if the source code it is reading fails to parse. The second method, run, throws the exception BASICRuntimeError when an error occurs during the execution of the program. The kinds of situations that would throw this error include dividing by zero or reading from an uninitialized variable.

These two methods, load and run, define the two halves of any interpreter, loading and executing.

Getting the program loaded

The

load

method in the

Program

class begins as follows:

public static Program load(InputStream source, PrintStream out) throws IOException, BASICSyntaxError {
        DataInputStream dis = null;
        char data[] = new char[256];
        LexicalTokenizer lt = new LexicalTokenizer(data);
        String lineData;
        Statement s;
        Token t;
        Program result = new Program();
    if (! (source instanceof DataInputStream))
        dis = new DataInputStream(new BufferedInputStream(source));
    else
        dis = source;

In the code above, the variables for the parsing section of the interpreter are initialized. These include a lexical analyzer (in the form of the LexicalTokenizer class) and a DataInputStream. There is a simple optimization that checks to see whether or not the method was passed an instance of a DataInputStream. The data array is an array of characters the lexical analyzer will use while reading in the source code. Note that it puts a limit of about 256 characters on the input line.

The bulk of the loading method is simply a loop that reads in data (in the form of strings) and then parses it. Here is the source code for this loop:

1      while (true) {
2        lineData = dis.readLine();
3
4        if (lineData == null) return result;
5
6        if (lineData.length() == 0) continue;
7
8        lt.reset(lineData);
9        t = lt.nextToken();
10        if (t.typeNum() != Token.CONSTANT) {
11              throw new BASICSyntaxError("Line failed to start with a line number.");
12        }
13
14        try {
15              s = ParseStatement.statement(lt);
16        } catch (BASICSyntaxError bse) {
17              out.println("Syntax error: "+bse.getMsg());
18              out.println(lt.showError());
19              throw bse;
20        }
21        s.addText(lineData);
22        s.addLine((int) t.numValue());
23        result.add((int) t.numValue(), s);
24     }
25  }

This loop will run until the return statement in line 4 returns a result, or the catch statement is executed in line 16 and the exception is thrown. The most important statements in this loop are lines #15, #8, and #23. In line 15 a valid BASIC statement is parsed. In line 8 the lexical analyzer is preloaded with the string containing a BASIC statement. In line 23 the parsed statement is stored in result (which is an instance of program) by adding the parsed statement to a binary search tree, indexed by the parsed line number.

All interpreters will need a collection of, or container for, statements. BASIC provides a convenient hook in the form of line numbers. My version uses a modified red-black tree, you may wish to refer to my column on containers if you are building your own interpreter.

In lines 21 and 22 I store a copy of the text that was read into a holding variable of the statement. Doing this allows me to implement a very simple program listing function where I simply iterate over the statement objects and print out the originally read line for each one.

Lexing BASIC statements

Decomposing the string into components that can be interpreted by the parser is the job of the lexical analyzer. For a review on lexical analysis see my

previous Java In Depth columns

in

JavaWorld

.

The lexical analyzer I designed for this interpreter falls right in the middle of the complexity range. It is too complicated to be handled by either StringTokenizer or StreamTokenizer, but it is simple enough that I found using a parser generator such as JavaCC was overkill. (See my article "Looking for lex and yacc for Java? You don't know Jack" in the December issue of JavaWorld.) Added to the complexity issue, or lack thereof, was the fact that customizing the lexical analyzers that are automatically generated is still evolving as a feature in the automatic generation space. So it was harder to write the analyzer code but easier to plug it into the interpreter.

I won't go into great detail on the lexical analyzer, but I will cover two aspects of it that make the interpreter easier to write.

The first "trick" one can play is to design a sophisticated Token class. You will recall from the discussion of lexical analyzers that the component parts produced are tokens. Some of the tokens you will parse will end up as variables in your interpreted code. I created a subclass of my Token class, named Variable, that was derived from Token by also including a setValue method. This allows me to store the variable directly after I receive it in the parser, without having to create a new object instance to hold it.

The second technique I used was to incorporate a mark and resetToMark feature in the analyzer so that I could parse several possible choices, first by marking my position and trying a parse, then resetting and trying a different parse path.

The most commonly used method in the lexical analyzer is nextToken. This method encapsulates all the rules regarding how a stream of ASCII characters should be divided into individual tokens. A version of this method is shown below to demonstrate the flow, although I've edited out redundant parts to save space. See the Resources section for links to the complete source.

Token nextToken() {
        Token r;
    ....
        if (currentPos >= buffer.length)
            return EOLToken;

The conditional statement above guarantees that every time nextToken is called, it returns something, even if it is only an end-of-line indication.

The statement below saves our position in the input stream so the lexical analyzer can return to the current position should the analysis of the incoming character stream either fail to produce a valid token, or the token that is returned needs to be pushed back to be re-read by another part of the interpreter. This technique of pushing back the last token to be re-read is used extensively in the expression evaluator. (See the source code in ParseExpression.java.)

previousPos = currentPos;

The next step is to create a loop to consume all of the white space (tabs, spaces, and so on) under the current parse point. The definition of white space is encapsulated in the private isSpace method:

while (isSpace(buffer[currentPos]))
            currentPos++;

The bulk of the code is a giant switch statement, based on the character value, first checking for the operator characters as shown here,

switch (buffer[currentPos]) {
            case '+' :
                currentPos++;
                return new Token(Token.OPERATOR, "+", Expression.OP_ADD);

and then the multiple character operators as shown below. In the following code the analyzer is trying to match the tokens '<', '<=', or '<>'.

case '<' :
                if (buffer[currentPos+1] == '=') {
                    currentPos += 2;
                    return new Token(Token.OPERATOR, "<=", Expression.OP_LE);
                } else if (buffer[currentPos+1] == '>') {
                    currentPos += 2;
                    return new Token(Token.OPERATOR, "<>", Expression.OP_NE);
                }
                currentPos++;
                return new Token(Token.OPERATOR, "<", Expression.OP_LT);

After the operators, various symbols are identified as shown below:

case '(' :
            case '\'':
        ...
            case ',' :
                return new Token(Token.SYMBOL, (double) buffer[currentPos++]);

Once all possible symbols are checked for, the analyzer checks to see if the token could be a numeric constant. A number always starts with either a digit or a decimal point so the analyzer checks for these characters before calling a helper method to actually convert the number into a Java number:

case '.' :
                if (r != null)
                    return r;
            case '0':
            case '1':
        ...
            case '9':
                r = parseNumericConstant();
                if (r != null)
                    return r;
                return new Token(Token.SYMBOL, (double) buffer[currentPos++]);

Once numeric characters and symbols are out of the way, the class attempts some "universal" end of line processing, hiding from the user the differences in line endings between Unix, DOS, and MacOS.

case '\r' :
            case '\n' :
                while (currentPos < buffer.length)
                    currentPos++;
                return EOLToken;

Next, if the current character isn't one of the above symbols, or a numeric constant, or an end of line, the penultimate step is to check to see if perhaps it is a double quote character, which indicates the beginning of a quoted string constant. If the character is a double quote, the code below extracts the string between the quotes and returns a string constant token:

case '"' :
                StringBuffer sb = new StringBuffer();
        ... parse quoted string ...
                }

Finally, if the code above has not been able to identify the token, we can safely assume that the input character the lexical analyzer is looking at is a letter. This assumption is predicated on the fact that any non-letter character would have already been consumed by the earlier code.

The remaining code consumes the letters and numbers following the initial letter up to the next non-alphabetic character and processes the resulting string potentially as one of the following token types:

  • Function name -- particularly if the string ended in "(", the list of function names is checked for a match.

  • Keyword -- one of BASIC's reserved words such as print, if, and so on.

  • Command -- primarily for interactive mode such as run, list, new, and so on.

  • Array Variable -- if the token is followed by "(" and it isn't a function, then it is assumed to be an array.

  • Scalar Variable -- the last "catch-all" choice.

The final result is a relatively large amount of code converting the characters into a token object that is easily manipulated by the next stage, the statement parser.

Collecting tokens into BASIC statements

The next major class in the interpreter is a class used to parse statements, named

ParseStatement

.

The ParseStatement class is a subclass of the Statement class. The implementation of the Statement class was affected by a design issue: I wanted some basic methods that all statements shared, such as trace, execute, and unparse, and I didn't want a single monolithic class that defined the entire set of all statements. By designing it in this way, I was able to easily add additional statement types with a minimum impact on the classes that were already in place.

1 2 Page 1
Page 1 of 2