Lexical analysis, Part 2: Build an application

How to use the StreamTokenizer object to implement an interactive calculator

Last month I looked at the classes that Java provides to do basic lexical analysis. This month I'll walk through a simple application that uses StreamTokenizer to implement an interactive calculator.

To review last month's article briefly, there are two lexical-analyzer classes that are included with the standard Java distribution: StringTokenizer and StreamTokenizer. These analyzers convert their input into discrete tokens that a parser can use to understand a given input. The parser implements a grammar, which is defined as one or more goal states reached by seeing various sequences of tokens. When a parser's goal state is reached, it executes some action. When the parser detects that there are no possible goal states given the current sequence of tokens, it defines this as an error state. When a parser reaches an error state, it executes a recovery action, which gets the parser back to a point at which it can begin parsing again. Typically, this is implemented by consuming tokens until the parser is back to a valid starting point.

Last month I showed you some methods that used a StringTokenizer to parse some input parameters. This month I'll show you an application that uses a StreamTokenizer object to parse an input stream and implement an interactive calculator.

Building an application

Our example is an interactive calculator that is similar to the Unix bc(1) command. As you'll see, it pushes the StreamTokenizer class right to the edge of its utility as a lexical analyzer. Thus, it serves as a good demonstration of where the line between "simple" and "complex" analyzers can be drawn. This example is a Java application and therefore runs best from the command line.

As a quick summary of its abilities, the calculator accepts expressions in the form

[variable name] "=" expression 

The variable name is optional and can be any string of characters in the default word range. (You can use the exerciser applet from last month's article to refresh your memory on these characters.) If the variable name is omitted, the value of the expression simply is printed. If the variable name is present, the value of the expression is assigned to the variable. Once variables have been assigned to, they can be used in later expressions. Thus, they fill the role of "memories" on a modern hand-held calculator.

The expression is composed of operands in the form of numeric constants (double-precision, floating-point constants) or variable names, operators, and parentheses for grouping particular computations. The legal operators are addition (+), subtraction (-), multiplication (*), division (/), bitwise AND (&), bitwise OR (|), bitwise XOR (#), exponentiation (^), and unary negation with either minus (-) for the twos complement result or bang (!) for the ones complement result.

In addition to these statements, our calculator application also can take one of four commands: "dump," "clear," "help," and "quit." The dump command prints out all of the variables that are currently defined as well as their values. The clear command erases all of the currently-defined variables. The help command prints out a few lines of help text to get the user started. The quit command causes the application to exit.

The entire example application consists of two parsers -- one for commands and statements, and one for expressions.

Building a command parser

The command parser is implemented in the application class for the example STExample.java. (See the Resources section for a pointer to the code.) The main method for that class is defined below. I'll walk through the pieces for you.

 
 1    public static void main(String args[]) throws IOException {
 2        Hashtable variables = new Hashtable();
 3        StreamTokenizer st = new StreamTokenizer(System.in);
 4        st.eolIsSignificant(true);
 5        st.lowerCaseMode(true);
 6        st.ordinaryChar('/');
 7        st.ordinaryChar('-');

In the code above the first thing I do is allocate a java.util.Hashtable class to hold the variables. After that I allocate a StreamTokenizer and adjust it slightly from its defaults. The rationale for the changes are as follows:

  • eolIsSignificant is set to true so that the tokenizer will return an indication of end of line. I use the end of the line as the point where the expression ends.

  • lowerCaseMode is set to true so that the variable names will always be returned in lowercase. This way, variable names are not case-sensitive.

  • The slash character (/) is set to be an ordinary character so that it will not be used to indicate the start of a comment, and can be used instead as the division operator.

  • The minus character (-) is set to be an ordinary character so that the string "3-3" will segment into three tokens -- "3", "-", and "3" -- rather than just "3" and "-3." (Remember, number parsing is set to "on" by default.)

Once the tokenizer is set up, the command parser runs in an infinite loop (until it recognizes the "quit" command at which point it exits). This is shown below.

 
 8        while (true) {
 9            Expression res;
10            int c = StreamTokenizer.TT_EOL;
11            String varName = null;
12
13            System.out.println("Enter an expression...");
14            try {
15                while (true) {
16                    c = st.nextToken();
17                    if (c == StreamTokenizer.TT_EOF) {
18                        System.exit(1);
19                    } else if (c == StreamTokenizer.TT_EOL) {
20                        continue;
21                    } else if (c == StreamTokenizer.TT_WORD) {
22                        if (st.sval.compareTo("dump") == 0) {
23                            dumpVariables(variables);
24                            continue;
25                        } else if (st.sval.compareTo("clear") == 0) {
26                            variables = new Hashtable();
27                            continue;
28                        } else if (st.sval.compareTo("quit") == 0) {
29                            System.exit(0);
30                        } else if (st.sval.compareTo("exit") == 0) {
31                            System.exit(0);
32                        } else if (st.sval.compareTo("help") == 0) {
33                            help();
34                            continue;
35                        }
36                        varName = st.sval;
37                        c = st.nextToken();
38                    }
39                    break;
40                }
41                if (c != '=') {
42                    throw new SyntaxError("missing initial '=' sign.");
43                }

As you can see in line 16, the first token is called by invoking nextToken on the StreamTokenizer object. This returns a value indicating the kind of token that was scanned. The return value either will be one of the defined constants in the StreamTokenizer class or it will be a character value. The "meta" tokens (those that are not simply character values) are defined as follows:

  • TT_EOF -- This indicates you are at the end of the input stream. Unlike StringTokenizer, there is no hasMoreTokens method.

  • TT_EOL -- This tells you that the object has just passed an end-of-line sequence.

  • TT_NUMBER -- This token type tells your parser code that a number has been seen on the input.

  • TT_WORD -- This token type indicates an entire "word" was scanned.

When the result is not one of the above constants, it is either the character value representing a character in the "ordinary" character range that was scanned or one of the quote characters you have set. (In my case, no quote character is set.) When the result is one of your quote characters, the quoted string can be found in the string instance variable sval of the StreamTokenizer object.

The code in lines 17 through 20 deals with end-of-line and end-of-file indications, whereas in line 21 the if clause is taken if a word token was returned. In this simple example, the word is either a command or a variable name. Lines 22 through 35 deal with the four possible commands. If line 36 is reached, then it must be a variable name; consequently, the program keeps a copy of the variable name and gets the next token, which must be an equal sign.

If in line 41 the token wasn't an equal sign, our simple parser detects an error state and throws an exception to signal it. I created two generic exceptions, SyntaxError and ExecError, to distinguish parse-time errors from run-time errors. The main method continues with line 44 below.

44                res = ParseExpression.expression(st);
45            } catch (SyntaxError se) {
46                res = null;
47                varName = null;
48                System.out.println("\nSyntax Error detected! - "+se.getMsg());
49                while (c != StreamTokenizer.TT_EOL)
50                    c = st.nextToken();
51                continue;
52            }

In line 44 the expression to the right of the equal sign is parsed with the expression parser defined in the ParseExpression class. Note that lines 14 through 44 are wrapped in a try/catch block that traps syntax errors and deals with them. When an error is detected, the parser's recovery action is to consume all tokens up to and including the next end-of-line token. This is shown in lines 49 and 50 above.

At this point, if an exception was not thrown, the application has successfully parsed a statement. The final check is to see that the next token is the end of line. If it isn't, an error has gone undetected. The most common error will be mismatched parentheses. This check is shown in lines 53 through 60 of the code below.

53            c = st.nextToken();
54            if (c != StreamTokenizer.TT_EOL) {
55                if (c == ')')
56                    System.out.println("\nSyntax Error detected! - To many closing parens.");
57                else
58                    System.out.println("\nBogus token on input - "+c);
59                while (c != StreamTokenizer.TT_EOL)
60                    c = st.nextToken();
61            } else {

When the next token is an end of line, the program executes lines 62 through 69 (shown below). This section of the method evaluates the parsed expression. If the variable name was set in line 36, the result is stored in the symbol table. In either case, if no exception is thrown, the expression and its value are printed to the System.out stream so that you can see what the parser decoded.

62                try {
63                    Double z;
64                    System.out.println("Parsed expression : "+res.unparse());
65                    z = new Double(res.value(variables));
66                    System.out.println("Value is : "+z);
67                    if (varName != null) {
68                        variables.put(varName, z);
69                        System.out.println("Assigned to : "+varName);
70                    }
71                } catch (ExecError ee) {
72                    System.out.println("Execution error, "+ee.getMsg()+"!");
73                }
74            }
75        }
76    }

In the STExample class, the StreamTokenizer is being used by a command-processor parser. This type of parser commonly would be used in a shell program or in any situation in which the user issues commands interactively. The second parser is encapsulated in the ParseExpression class. (See the Resources section for the complete source.) This class parses the calculator's expressions and is invoked in line 44 above. It is here that StreamTokenizer faces its stiffest challenge.

Building an expression parser

The grammar for the calculator's expressions defines an algebraic syntax of the form "[item] operator [item]." This type of grammar comes up again and again and is called an operator grammar. A convenient notation for an operator grammar is:

id ( "OPERATOR" id )*

The code above would be read "An ID terminal followed by zero or more occurrences of an operator-id tuple." The StreamTokenizer class would seem pretty ideal for analyzing such streams, because the design naturally breaks the input stream into word, number, and ordinary character tokens. As I'll show you, this is true up to a point.

The ParseExpression class is a straightforward, recursive-descent parser for expressions, right out of an undergraduate compiler-design class. The Expression method in this class is defined as follows:

 
 1    static Expression expression(StreamTokenizer st) throws SyntaxError {
 2        Expression result;
 3        boolean done = false;
 4
 5        result = sum(st);
 6        while (! done) {
 7            try {
 8                switch (st.nextToken()) {
 9                    case '&' :
10                        result = new Expression(OP_AND, result, sum(st));
11                        break;
12                    case '|' :
13                        result = new Expression(OP_IOR, result, sum(st));
14                        break;
15                    case '#' :
16                        result = new Expression(OP_XOR, result, sum(st));
17                        break;
18                    default:
19                        done = true;
20                        st.pushBack();
21                        break;
22                }
23            } catch (IOException ioe) {
24                throw new SyntaxError("Got an I/O Exception.");
25            }
26        }
27        return result;
28    }
1 2 Page 1
Page 1 of 2