How to build an interpreter in Java, Part 2: The structure

The trick to assembling the foundation classes for a simple interpreter

1 2 Page 2
Page 2 of 2

ParseStatement's factory method is named doParse; it takes as an argument an instance of the LexicalTokenizer class and returns an instance of the Statement class. This structure is what makes the simple "compiler" loop in the Program class's load method so trivial to write.

The implementation of the doParse method is fairly mechanical. There are three valid initial token streams for BASIC statements:

  1. The single quote (') and question mark (?) symbols, which are shorthand for the rem and print keywords respectively.

  2. A keyword identifying a BASIC statement.

  3. A variable, which would identify a shorthand version of the let statement.

The requirement for case #1 in the list above stems from old BASIC interpreters that used these symbols in lieu of keywords to save space in precious random-acess memory. The doParse method treats lines with these initial symbols exactly as it would if those lines started with the keyword. The code that handles case #1 is shown below:

1    static Statement doParse(LexicalTokenizer lt) throws BASICSyntaxError {
2        Statement s;
3        Token   t;
4
5        t = lt.nextToken();
6        if (t.typeNum() == Token.SYMBOL) {
7            switch ((int) t.numValue()) {
8                case '?':
9                    s = new PRINTStatement(lt);
10                    t = lt.nextToken();
11                    if ((t == null) || (t.typeNum() == Token.EOL))
12                        return s;
13
14                    if (t.isSymbol(':')) {
15                        s.nxt = statement(lt);
16                        return s;
17                    }
18                    throw new BASICSyntaxError(extraError);
19                case '\'':
20                    s = new REMStatement(lt);
21                    return s;
22                default :
23                    throw new BASICSyntaxError("Illegal statement symbol start?");
24            }
25        }

As you can see from the code listing, once a symbol is parsed as the initial token, the only two exits from this code are either a return statement, or the throwing of an exception. Take the case of the "?" symbol. If it is returned by the lexical analyzer, the doParse method creates a new statement by instantiating an instance of the PRINTStatement class. As we will see a bit later, this invokes a parser specific to print statements on the remaining tokens on the input line. The next interesting bit is the way in which the colon (":") character is handled. This character is used by the BASIC interpreter to separate multiple statements on a line, so once the print statement is parsed, the next token must be either an end of line (EOL) or a symbol token (whose value is ":"). Otherwise, there is unparsed data after the statement proper, and this always results in an error.

Parsing keyword statements for case #2, in the three-part list above, is very similar. As the code below shows, if the token is a keyword, then a large switch statement is entered. The cases for the switch statement are those keywords that can legally start a BASIC statement. The code parsing keyword statements is shown here:

1        if (t.typeNum() == Token.KEYWORD) {
2            switch ((int) t.numValue()) {
3                case TRON:
4                    s = new TRONStatement(lt);
5                    t = lt.nextToken();
6                    if (t.isSymbol(':')) {
7                        s.nxt = statement(lt);
8                    } else if (t.typeNum() != Token.EOL)
9                        throw new BASICSyntaxError(extraError);
10                    return s;
11
12                case TROFF:
13                    s = new TROFFStatement(lt);
14                    t = lt.nextToken();
15                    if (t.isSymbol(':')) {
16                        s.nxt = statement(lt);
17                    } else if (t.typeNum() != Token.EOL)
18                        throw new BASICSyntaxError(extraError);
19                    return s;
...     ...

This code demonstrates the "cookie cutter" nature of parsers -- a feature that makes them so amenable to automatic generation. In pseudo code each case can be expressed as:

  • For each valid keyword:
    • Parse the tokens specifically for this keyword
    • Identify the next token, allow one of:
      • End of line
      • A follow-on statement

Other than the class used to parse the tokens, the only variable is whether or not it makes sense to allow additional statements after the current statement. Some statements, such as end and rem, do not allow another statement to follow them on the same line. Finally, this version of BASIC allows you to specify a let statement (let is the BASIC keyword for assignment statements) without actually specifying the let keyword. Typically a statement that is entered as,

100 A = 10

is interpreted as the let statement shown below:

100 LET A = 10

The let keyword is implicit in this case. The implicit let statement is parsed by the following code:

21        } else if (t.typeNum() == Token.VARIABLE) {
22            lt.unGetToken();
23            s = new LETStatement(lt);
24            t = lt.nextToken();
25            if ((t == null) || (t.typeNum() == Token.EOL)) {
26                return s;
27            } else if (t.isSymbol(':')) {
28                s.nxt = statement(lt);
29                return s;
30            }
31            throw new BASICSyntaxError(extraError);
32        }
33        throw new BASICSyntaxError("Unrecognized statement");

As the code above shows, if the first token on the line is a variable, it is first "pushed back" onto the input stream in line 22, and then a new LETStatement object is instantiated as if the parser had actually seen the let keyword.

All the keyword statement subclasses have the parser built into the constructor. An example of this is the GOTOStatement class, whose constructor is simply:

GOTOStatement(LexicalTokenizer lt) throws BASICSyntaxError {
        super(GOTO);
        parse(this, lt);
    }

Here you can see that the GOTOStatement class invokes the Statement constructor that takes a keyword identifier as a parameter. Then it invokes a static parse statement, which parses the parameters that can follow the GOTO keyword in a BASIC program. This class provides an example of a constructor that throws an exception (in this case a syntax error exception).

The parse method is shown below, and as you can read in the code, it is a private static method responsible for parsing the necessary parameters for a GOTO. In the case of GOTO, the only valide token that could follow the keyword is simply a line number; however it could just as easily have been an expression.

private static void parse(GOTOStatement s, LexicalTokenizer lt) throws BASICSyntaxError {
        Token t = lt.nextToken();
        if (t.typeNum() != Token.CONSTANT) {
            throw new BASICSyntaxError("Line number required after GOTO.");
        }
        s.lineTarget = (int) t.numValue();
    }

Once complete, the parse method leaves the lexical analyzer ready to fetch the first token after the GOTO's parameters. This next token must be an end-of-line token rather than a follow on statement because a GOTO transfers control of the BASIC program unconditionally to the line whose number follows the keyword. Any statement that followed the GOTO on the same line would be inaccessible as it would have no line number associated with it.

Wrapping up

Of course, there are many subclasses of

Statement

, one for each keyword. The advantage of designing your interpreter using a technique like the one described in this column is that you can work on a statement with the confidence and knowledge that you aren't causing other statements problems with side effects. This encapsulation is one of the real benefits of an object-oriented language and it really shines when writing applications such as this one. Check out the pointers in the

Resources section

, and in particular, check out the complete classes to get a feel for how the entire parsing structure fairly clips together.

In the next and final installment in the BASIC series, I'll take a look at how you can execute these statements now that they are parsed and how you can hook the resulting interpreter into another Java class.

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+.

Learn more about this topic

1 2 Page 2
Page 2 of 2