Jun 1, 1997 1:00 AM PT

How to build an interpreter in Java, Part 2: The structure

The trick to assembling the foundation classes for a simple interpreter

In last month's column I discussed the idea of building an interpreter in Java and introduced my version of BASIC, named Cocoa. In this month's column we will jump right into the source code for the Cocoa interpreter. However, as the emphasis here is on building interpreters in Java, more attention will be paid to the structure than to the actual code. As always, the full source code for the column is available in the Resources section below.

The Program class

All the constituent classes of the interpreter are collected into a single Java package. For this interpreter, that package is named

basic

. (Here is

Cocoa's programming documentation

.) This does contravene Sun's suggested naming convention for packages for no good reason, but then again Sun's naming scheme, as described in the Java Language spec, isn't particularly well motivated either. :-) However, there is good reason to put the classes together in a single package, and that is to provide visibility of interpreter private methods to other interpreter classes without exposing them to the interpreter's clients.

As I discussed last month, the primary class for the interpreter is a public class with a factory method to create executable programs. I chose to implement this load method in a class named basic.Program. (For the rest of the article we'll assume the class names are all preceded by the "basic" package name without expressly calling it out.) Program exports a static method whose signature is:

public static Program load(InputStream source, PrintStream out)
        throws IOException, BASICSyntaxError { ... }

This method is responsible for getting the text in an InputStream parsed and collected into something that can be interpreted. It returns an instance of a Program object, which is the parsed program. The other important public method in Program is run, whose signature is as follows:

public void run(InputStream in, OutputStream out) throws BASICRuntimeError {
        PrintStream pout;

The run method is not static; it actually runs the instance of the program using the input and output streams that are passed for doing character I/O.

As you can see, both of these methods throw exceptions. The first throws an IOException if an I/O error occurs while reading the source code from the input stream. The load method also can throw a BASICSyntaxError exception if the source code it is reading fails to parse. The second method, run, throws the exception BASICRuntimeError when an error occurs during the execution of the program. The kinds of situations that would throw this error include dividing by zero or reading from an uninitialized variable.

These two methods, load and run, define the two halves of any interpreter, loading and executing.

Getting the program loaded

The

load

method in the

Program

class begins as follows:

public static Program load(InputStream source, PrintStream out) throws IOException, BASICSyntaxError {
        DataInputStream dis = null;
        char data[] = new char[256];
        LexicalTokenizer lt = new LexicalTokenizer(data);
        String lineData;
        Statement s;
        Token t;
        Program result = new Program();
    if (! (source instanceof DataInputStream))
        dis = new DataInputStream(new BufferedInputStream(source));
    else
        dis = source;

In the code above, the variables for the parsing section of the interpreter are initialized. These include a lexical analyzer (in the form of the LexicalTokenizer class) and a DataInputStream. There is a simple optimization that checks to see whether or not the method was passed an instance of a DataInputStream. The data array is an array of characters the lexical analyzer will use while reading in the source code. Note that it puts a limit of about 256 characters on the input line.

The bulk of the loading method is simply a loop that reads in data (in the form of strings) and then parses it. Here is the source code for this loop:

1      while (true) {
2        lineData = dis.readLine();
3
4        if (lineData == null) return result;
5
6        if (lineData.length() == 0) continue;
7
8        lt.reset(lineData);
9        t = lt.nextToken();
10        if (t.typeNum() != Token.CONSTANT) {
11              throw new BASICSyntaxError("Line failed to start with a line number.");
12        }
13
14        try {
15              s = ParseStatement.statement(lt);
16        } catch (BASICSyntaxError bse) {
17              out.println("Syntax error: "+bse.getMsg());
18              out.println(lt.showError());
19              throw bse;
20        }
21        s.addText(lineData);
22        s.addLine((int) t.numValue());
23        result.add((int) t.numValue(), s);
24     }
25  }

This loop will run until the return statement in line 4 returns a result, or the catch statement is executed in line 16 and the exception is thrown. The most important statements in this loop are lines #15, #8, and #23. In line 15 a valid BASIC statement is parsed. In line 8 the lexical analyzer is preloaded with the string containing a BASIC statement. In line 23 the parsed statement is stored in result (which is an instance of program) by adding the parsed statement to a binary search tree, indexed by the parsed line number.

All interpreters will need a collection of, or container for, statements. BASIC provides a convenient hook in the form of line numbers. My version uses a modified red-black tree, you may wish to refer to my column on containers if you are building your own interpreter.

In lines 21 and 22 I store a copy of the text that was read into a holding variable of the statement. Doing this allows me to implement a very simple program listing function where I simply iterate over the statement objects and print out the originally read line for each one.

Lexing BASIC statements

Decomposing the string into components that can be interpreted by the parser is the job of the lexical analyzer. For a review on lexical analysis see my

previous Java In Depth columns

in

JavaWorld

.

The lexical analyzer I designed for this interpreter falls right in the middle of the complexity range. It is too complicated to be handled by either StringTokenizer or StreamTokenizer, but it is simple enough that I found using a parser generator such as JavaCC was overkill. (See my article "Looking for lex and yacc for Java? You don't know Jack" in the December issue of JavaWorld.) Added to the complexity issue, or lack thereof, was the fact that customizing the lexical analyzers that are automatically generated is still evolving as a feature in the automatic generation space. So it was harder to write the analyzer code but easier to plug it into the interpreter.

I won't go into great detail on the lexical analyzer, but I will cover two aspects of it that make the interpreter easier to write.

The first "trick" one can play is to design a sophisticated Token class. You will recall from the discussion of lexical analyzers that the component parts produced are tokens. Some of the tokens you will parse will end up as variables in your interpreted code. I created a subclass of my Token class, named Variable, that was derived from Token by also including a setValue method. This allows me to store the variable directly after I receive it in the parser, without having to create a new object instance to hold it.

The second technique I used was to incorporate a mark and resetToMark feature in the analyzer so that I could parse several possible choices, first by marking my position and trying a parse, then resetting and trying a different parse path.

The most commonly used method in the lexical analyzer is nextToken. This method encapsulates all the rules regarding how a stream of ASCII characters should be divided into individual tokens. A version of this method is shown below to demonstrate the flow, although I've edited out redundant parts to save space. See the Resources section for links to the complete source.

Token nextToken() {
        Token r;
    ....
        if (currentPos >= buffer.length)
            return EOLToken;

The conditional statement above guarantees that every time nextToken is called, it returns something, even if it is only an end-of-line indication.

The statement below saves our position in the input stream so the lexical analyzer can return to the current position should the analysis of the incoming character stream either fail to produce a valid token, or the token that is returned needs to be pushed back to be re-read by another part of the interpreter. This technique of pushing back the last token to be re-read is used extensively in the expression evaluator. (See the source code in ParseExpression.java.)

previousPos = currentPos;

The next step is to create a loop to consume all of the white space (tabs, spaces, and so on) under the current parse point. The definition of white space is encapsulated in the private isSpace method:

while (isSpace(buffer[currentPos]))
            currentPos++;

The bulk of the code is a giant switch statement, based on the character value, first checking for the operator characters as shown here,

switch (buffer[currentPos]) {
            case '+' :
                currentPos++;
                return new Token(Token.OPERATOR, "+", Expression.OP_ADD);

and then the multiple character operators as shown below. In the following code the analyzer is trying to match the tokens '<', '<=', or '<>'.

case '<' :
                if (buffer[currentPos+1] == '=') {
                    currentPos += 2;
                    return new Token(Token.OPERATOR, "<=", Expression.OP_LE);
                } else if (buffer[currentPos+1] == '>') {
                    currentPos += 2;
                    return new Token(Token.OPERATOR, "<>", Expression.OP_NE);
                }
                currentPos++;
                return new Token(Token.OPERATOR, "<", Expression.OP_LT);

After the operators, various symbols are identified as shown below:

case '(' :
            case '\'':
        ...
            case ',' :
                return new Token(Token.SYMBOL, (double) buffer[currentPos++]);

Once all possible symbols are checked for, the analyzer checks to see if the token could be a numeric constant. A number always starts with either a digit or a decimal point so the analyzer checks for these characters before calling a helper method to actually convert the number into a Java number:

case '.' :
                if (r != null)
                    return r;
            case '0':
            case '1':
        ...
            case '9':
                r = parseNumericConstant();
                if (r != null)
                    return r;
                return new Token(Token.SYMBOL, (double) buffer[currentPos++]);

Once numeric characters and symbols are out of the way, the class attempts some "universal" end of line processing, hiding from the user the differences in line endings between Unix, DOS, and MacOS.

case '\r' :
            case '\n' :
                while (currentPos < buffer.length)
                    currentPos++;
                return EOLToken;

Next, if the current character isn't one of the above symbols, or a numeric constant, or an end of line, the penultimate step is to check to see if perhaps it is a double quote character, which indicates the beginning of a quoted string constant. If the character is a double quote, the code below extracts the string between the quotes and returns a string constant token:

case '"' :
                StringBuffer sb = new StringBuffer();
        ... parse quoted string ...
                }

Finally, if the code above has not been able to identify the token, we can safely assume that the input character the lexical analyzer is looking at is a letter. This assumption is predicated on the fact that any non-letter character would have already been consumed by the earlier code.

The remaining code consumes the letters and numbers following the initial letter up to the next non-alphabetic character and processes the resulting string potentially as one of the following token types:

  • Function name -- particularly if the string ended in "(", the list of function names is checked for a match.

  • Keyword -- one of BASIC's reserved words such as print, if, and so on.

  • Command -- primarily for interactive mode such as run, list, new, and so on.

  • Array Variable -- if the token is followed by "(" and it isn't a function, then it is assumed to be an array.

  • Scalar Variable -- the last "catch-all" choice.

The final result is a relatively large amount of code converting the characters into a token object that is easily manipulated by the next stage, the statement parser.

Collecting tokens into BASIC statements

The next major class in the interpreter is a class used to parse statements, named

ParseStatement

.

The ParseStatement class is a subclass of the Statement class. The implementation of the Statement class was affected by a design issue: I wanted some basic methods that all statements shared, such as trace, execute, and unparse, and I didn't want a single monolithic class that defined the entire set of all statements. By designing it in this way, I was able to easily add additional statement types with a minimum impact on the classes that were already in place.

ParseStatement's factory method is named doParse; it takes as an argument an instance of the LexicalTokenizer class and returns an instance of the Statement class. This structure is what makes the simple "compiler" loop in the Program class's load method so trivial to write.

The implementation of the doParse method is fairly mechanical. There are three valid initial token streams for BASIC statements:

  1. The single quote (') and question mark (?) symbols, which are shorthand for the rem and print keywords respectively.

  2. A keyword identifying a BASIC statement.

  3. A variable, which would identify a shorthand version of the let statement.

The requirement for case #1 in the list above stems from old BASIC interpreters that used these symbols in lieu of keywords to save space in precious random-acess memory. The doParse method treats lines with these initial symbols exactly as it would if those lines started with the keyword. The code that handles case #1 is shown below:

1    static Statement doParse(LexicalTokenizer lt) throws BASICSyntaxError {
2        Statement s;
3        Token   t;
4
5        t = lt.nextToken();
6        if (t.typeNum() == Token.SYMBOL) {
7            switch ((int) t.numValue()) {
8                case '?':
9                    s = new PRINTStatement(lt);
10                    t = lt.nextToken();
11                    if ((t == null) || (t.typeNum() == Token.EOL))
12                        return s;
13
14                    if (t.isSymbol(':')) {
15                        s.nxt = statement(lt);
16                        return s;
17                    }
18                    throw new BASICSyntaxError(extraError);
19                case '\'':
20                    s = new REMStatement(lt);
21                    return s;
22                default :
23                    throw new BASICSyntaxError("Illegal statement symbol start?");
24            }
25        }

As you can see from the code listing, once a symbol is parsed as the initial token, the only two exits from this code are either a return statement, or the throwing of an exception. Take the case of the "?" symbol. If it is returned by the lexical analyzer, the doParse method creates a new statement by instantiating an instance of the PRINTStatement class. As we will see a bit later, this invokes a parser specific to print statements on the remaining tokens on the input line. The next interesting bit is the way in which the colon (":") character is handled. This character is used by the BASIC interpreter to separate multiple statements on a line, so once the print statement is parsed, the next token must be either an end of line (EOL) or a symbol token (whose value is ":"). Otherwise, there is unparsed data after the statement proper, and this always results in an error.

Parsing keyword statements for case #2, in the three-part list above, is very similar. As the code below shows, if the token is a keyword, then a large switch statement is entered. The cases for the switch statement are those keywords that can legally start a BASIC statement. The code parsing keyword statements is shown here:

1        if (t.typeNum() == Token.KEYWORD) {
2            switch ((int) t.numValue()) {
3                case TRON:
4                    s = new TRONStatement(lt);
5                    t = lt.nextToken();
6                    if (t.isSymbol(':')) {
7                        s.nxt = statement(lt);
8                    } else if (t.typeNum() != Token.EOL)
9                        throw new BASICSyntaxError(extraError);
10                    return s;
11
12                case TROFF:
13                    s = new TROFFStatement(lt);
14                    t = lt.nextToken();
15                    if (t.isSymbol(':')) {
16                        s.nxt = statement(lt);
17                    } else if (t.typeNum() != Token.EOL)
18                        throw new BASICSyntaxError(extraError);
19                    return s;
...     ...

This code demonstrates the "cookie cutter" nature of parsers -- a feature that makes them so amenable to automatic generation. In pseudo code each case can be expressed as:

  • For each valid keyword:
    • Parse the tokens specifically for this keyword
    • Identify the next token, allow one of:
      • End of line
      • A follow-on statement

Other than the class used to parse the tokens, the only variable is whether or not it makes sense to allow additional statements after the current statement. Some statements, such as end and rem, do not allow another statement to follow them on the same line. Finally, this version of BASIC allows you to specify a let statement (let is the BASIC keyword for assignment statements) without actually specifying the let keyword. Typically a statement that is entered as,

100 A = 10

is interpreted as the let statement shown below:

100 LET A = 10

The let keyword is implicit in this case. The implicit let statement is parsed by the following code:

21        } else if (t.typeNum() == Token.VARIABLE) {
22            lt.unGetToken();
23            s = new LETStatement(lt);
24            t = lt.nextToken();
25            if ((t == null) || (t.typeNum() == Token.EOL)) {
26                return s;
27            } else if (t.isSymbol(':')) {
28                s.nxt = statement(lt);
29                return s;
30            }
31            throw new BASICSyntaxError(extraError);
32        }
33        throw new BASICSyntaxError("Unrecognized statement");

As the code above shows, if the first token on the line is a variable, it is first "pushed back" onto the input stream in line 22, and then a new LETStatement object is instantiated as if the parser had actually seen the let keyword.

All the keyword statement subclasses have the parser built into the constructor. An example of this is the GOTOStatement class, whose constructor is simply:

GOTOStatement(LexicalTokenizer lt) throws BASICSyntaxError {
        super(GOTO);
        parse(this, lt);
    }

Here you can see that the GOTOStatement class invokes the Statement constructor that takes a keyword identifier as a parameter. Then it invokes a static parse statement, which parses the parameters that can follow the GOTO keyword in a BASIC program. This class provides an example of a constructor that throws an exception (in this case a syntax error exception).

The parse method is shown below, and as you can read in the code, it is a private static method responsible for parsing the necessary parameters for a GOTO. In the case of GOTO, the only valide token that could follow the keyword is simply a line number; however it could just as easily have been an expression.

private static void parse(GOTOStatement s, LexicalTokenizer lt) throws BASICSyntaxError {
        Token t = lt.nextToken();
        if (t.typeNum() != Token.CONSTANT) {
            throw new BASICSyntaxError("Line number required after GOTO.");
        }
        s.lineTarget = (int) t.numValue();
    }

Once complete, the parse method leaves the lexical analyzer ready to fetch the first token after the GOTO's parameters. This next token must be an end-of-line token rather than a follow on statement because a GOTO transfers control of the BASIC program unconditionally to the line whose number follows the keyword. Any statement that followed the GOTO on the same line would be inaccessible as it would have no line number associated with it.

Wrapping up

Of course, there are many subclasses of

Statement

, one for each keyword. The advantage of designing your interpreter using a technique like the one described in this column is that you can work on a statement with the confidence and knowledge that you aren't causing other statements problems with side effects. This encapsulation is one of the real benefits of an object-oriented language and it really shines when writing applications such as this one. Check out the pointers in the

Resources section

, and in particular, check out the complete classes to get a feel for how the entire parsing structure fairly clips together.

In the next and final installment in the BASIC series, I'll take a look at how you can execute these statements now that they are parsed and how you can hook the resulting interpreter into another Java class.

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+.

Learn more about this topic