Lexical analysis, Part 2: Build an application

How to use the StreamTokenizer object to implement an interactive calculator

1 2 Page 2
Page 2 of 2

The above method is responsible for parsing the bitwise logical operators. If you recall the article about Jack, this grammar is nearly identical except for some of the operators. In this grammar the logical operators are presumed to exist between two sum non-terminals. The sum non-terminals are parsed by calling the sum method in line 5 and in lines 10, 13, and 16. To understand why this parser isn't written as one giant switch statement for all of the operators, you will have to go back and read the Jack article.

The expression method works by presuming that the input stream contains the sequence "sum [ logical operator ] sum." It starts by parsing a sum non-terminal in line 5; then it looks at the next token in line 8. If the token is one of the logical operators, the code creates a new Expression tuple consisting of the operation code and the previous result, and then parses another sum non-terminal. If the token was not one of the logical operators, the method pushBack in the StreamTokenizer class is used to put the last token read "back" into the queue to be read again, and the loop exits.

The sum method is identical to the expression method, except that instead of calling sum and expecting logical operators, it calls term and expects the plus or minus operators. This is repeated for the term method, which calls factor and trys to parse the multiplication and division operators. Defining the parser this way establishes the precedence relationship between the operators.

The factor method called by the term method is slightly different and is shown below.

 
 1    static Expression factor(StreamTokenizer st) throws SyntaxError {
 2        Expression result;
 3
 4        result = primary(st);
 5        try {
 6            switch (st.nextToken()) {
 7                case '^':
 8                    result = new Expression(OP_EXP, result, factor(st));
 9                    break;
10                default:
11                    st.pushBack();
12                    break;
13            }
14        } catch (IOException ioe) {
15            throw new SyntaxError("Caught an I/O Exception.");
16        }
17        return result;
18    }

The factor method starts out similarly to the expression method, except that it is recursive to itself -- that is, it parses a primary in line 4, and then if it sees the exponentiation operator (^) in line 7, it creates a new Expression tuple in line 8. In the process of creating that tuple, it recurses to itself. By coding the method this way, the parse tree is constructed so that this operator is right-associative (that is, 2^3^4 is parsed as (2 ^ (3 ^ 4)) rather than ((2 ^ 3) ^ 4)).

The factor method was the method where the limitations of the StreamTokenizer became clear. With the factor method, I originally wanted to use the token "**" as the exponentiation operator. However, the implementation of StreamTokenizer made this impossible!

The problem lies in what is known as "lookahead." A lexical analyzer that supports lookahead can match a token whose first characters are identical to another token by looking ahead in the input stream to see which token is the "best" match. For lexical analyzers, the best match is defined as the longest match, or, stated a different way, the match that consumes the most input characters. So in this case, the tokens "*" for multiply and "**" for exponentiation are ambiguous unless the analyzer can look ahead when it encounters a "*" character. If the next character is also a "*," both are consumed, and the token is the exponentiation operator. If the next character is not a "*," then only one input character is consumed, and the token is the multiplication operator.

The concept of lookahead does not fit into the StreamTokenizer notion of the world. I tried three different ways of finessing this, all of which failed. Attempt #1 temporarily changed the value of "*" to a word character so that the tokenizer would return a TT_WORD token if either "*" or "**" was scanned. However, this also caused the expression 2**a to be parsed as "2" and "**a," which was unacceptable. Attempt #2 involved using the mark and reset methods of InputStream to "reset" the stream into the StreamTokenizer after looking ahead manually. However, this had the deleterious side effect of causing duplicate accumulation of characters. The final solution attempt was to invoke the pushBack method multiple times to back up the stream to a point before the first "*" was encountered. But the implementation of pushBack in StreamTokenizer is such that regardless of the number of times you call it, only the last token is pushed back into the stream.

I concluded that this conceptual issue of lookahead is the dividing line between a "simple" lexical analyzer and a "complex" one. Thus, I simply changed the operators so that they were all one character. In a more complex environment where I might want to have && and || be operators in my grammar, using StreamTokenizer clearly would not be possible.

Wrapping up

The example application shows two ways of using the StreamTokenizer class in conjunction with a parser. In the expression parsing component, the limitations of StreamTokenizer were reached, and some analysis of the implications of those limits was provided. Understanding both StringTokenizer and StreamTokenizer and applying them will save you time and effort in your Java programming.

In the example, the return result of the expression method of the ParseExpression class is a parse tree. In a compiler, this data structure would be used to generate code that evaluated the expression. In my example, I use it simply to evaluate the expression in place. If this were a spreadsheet application, the contents of a cell would be the parse tree that could be reevaluated when the contents of the variables (other cells) changed. The method unparse in the Expression class can be used to re-create the expression from the parse tree. Because the original parentheses were lost when the expression was parsed, the unparse method explicitly adds back more parentheses than are strictly necessary to convey the evaluation order of the expression.

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+.

Learn more about this topic

  • Source for the example consists of several files, these are:
  • STExample.java (application class). http://www.javaworld.com/javaworld/jw-02-1997/indepth/STExample.java
  • Expression.java (defines the basic Expression tuple in the expression parse tree). http://www.javaworld.com/javaworld/jw-02-1997/indepth/Expression.java
  • ParseExpression.java (implements the recursive-descent expression parser). http://www.javaworld.com/javaworld/jw-02-1997/indepth/ParseExpression.java
  • ConstantExpression.java (leaf node on the parse tree representing a numerical constant). http://www.javaworld.com/javaworld/jw-02-1997/indepth/ConstantExpression.java
  • VariableExpression.java (leaf node on the parse tree representing a variable value). http://www.javaworld.com/javaworld/jw-02-1997/indepth/VariableExpression.java
  • These two classes define the exception thrown for parsing errors and evaluation errors respectively. SyntaxError.java, http://www.javaworld.com/javaworld/jw-02-1997/indepth/SyntaxError.java
    ExecError.java http://www.javaworld.com/javaworld/jw-02-1997/ indepth/ExecError.java

Related:
1 2 Page 2
Page 2 of 2