Jan 1, 1997 12:00 AM PT

Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

Lexical analysis and parsing

When writing Java applications, one of the more common things you will be required to produce is a parser. Parsers range from simple to complex and are used for everything from looking at command-line options to interpreting Java source code. In JavaWorld's December issue, I showed you Jack, an automatic parser generator that converts high-level grammar specifications into Java classes that implement the parser described by those specifications. This month I'll show you the resources that Java provides to write targeted lexical analyzers and parsers. These somewhat simpler parsers fill the gap between simple string comparison and the complex grammars that Jack compiles.

The purpose of lexical analyzers is to take a stream of input characters and decode them into higher level tokens that a parser can understand. Parsers consume the output of the lexical analyzer and operate by analyzing the sequence of tokens returned. The parser matches these sequences to an end state, which may be one of possibly many end states. The end states define the goals of the parser. When an end state is reached, the program using the parser does some action -- either setting up data structures or executing some action-specific code. Additionally, parsers can detect -- from the sequence of tokens that have been processed -- when no legal end state can be reached; at that point the parser identifies the current state as an error state. It is up to the application to decide what action to take when the parser identifies either an end state or an error state.

The standard Java class base includes a couple of lexical analyzer classes, however it does not define any general-purpose parser classes. In this column I'll take an in-depth look at the lexical analyzers that come with Java.

Java's lexical analyzers

The Java Language Specification, version 1.0.2, defines two lexical analyzer classes, StringTokenizer and StreamTokenizer. From their names you can deduce that StringTokenizer uses String objects as its input, and StreamTokenizer uses InputStream objects.

The StringTokenizer class

Of the two available lexical analyzer classes, the easiest to understand is StringTokenizer. When you construct a new StringTokenizer object, the constructor method nominally takes two values -- an input string and a delimiter string. The class then constructs a sequence of tokens that represents the characters between the delimiter characters.

As a lexical analyzer, StringTokenizer could be formally defined as shown below.

[~delim1,delim2,...,delimN]    ::  Token

This definition consists of a regular expression that matches every character except the delimiter characters. All adjacent matching characters are collected into a single token and returned as a Token.

The most common use of the StringTokenizer class is for separating out a set of parameters -- such as a comma-separated list of numbers. StringTokenizer is ideal in this role because it removes the separators and returns the data. The StringTokenizer class also provides a mechanism for identifying lists in which there are "null" tokens. You would use null tokens in applications in which some parameters either have default values or are not required to be present in all cases.

The applet below is a simple StringTokenizer exerciser. The source to the StringTokenizer applet is here. To use the applet, type some text to be analyzed into the input string area, then type a string consisting of separator characters in the Separator String area. Finally, click on the Tokenize! button. The result will show up in the token list below the input string and will be organized as one token per line.

You need a Java-enabled browser to see this applet.

Consider as an example a string, "a, b, d", passed to a StringTokenizer object that has been constructed with a comma (,) as the separator character. If you put these values in the exerciser applet above you will see that the Tokenizer object returns the strings "a," "b," and "d." If your intention was to note that one parameter was missing, you may have been suprised to see no indication of this in the token sequence. The ability to detect missing tokens is enabled by the Return Separator boolean that can be set when you create a Tokenizer object. With this parameter set when the Tokenizer is constructed, each separator is also returned. Click the checkbox for Return Separator in the applet above, and leave the string and the separator alone. Now the Tokenizer returns "a, comma, b, comma, comma, and d." By noting that you get two separator characters in sequence, you can determine that a "null" token was included in the input string.

The trick to successfully using StringTokenizer in a parser is defining the input in such a way that the delimiter character does not appear in the data. Clearly you can avoid this restriction by designing for it in your application. The method definition below can be used as part of an applet that accepts a color in the form of red, green, and blue values in its parameter stream.

    
    /**
     * Parse a parameter of the form "10,20,30" as an
     * RGB tuple for a color value.
     */
 1    Color getColor(String name) {
 2        String data;
 3        StringTokenizer st;
 4        int red, green, blue;
 5        
 6        data = getParameter(name);
 7        if (data == null)
 8            return null;
 9            
10        st = new StringTokenizer(data, ",");
11        try {
12            red = Integer.parseInt(st.nextToken());
13            green = Integer.parseInt(st.nextToken());
14            blue = Integer.parseInt(st.nextToken());
15        } catch (Exception e) {
16            return null; // (ERROR STATE) could not parse it
17        }
18        return new Color(red, green, blue); // (END STATE) done.
19    }             

The code above implements a very simple parser that reads the string "number, number, number" and returns a new Color object. In line 10, the code creates a new StringTokenizer object that contains the parameter data (assume this method is part of an applet), and a separator character list that consists of commas. Then in lines 12, 13, and 14, each token is extracted from the string and converted into a number using the Integer parseInt method. These conversions are surrounded by a try/catch block in case the number strings were not valid numbers or the Tokenizer throws an exception because it has run out of tokens. If all of the numbers convert, the end state is reached and a Color object is returned; otherwise the error state is reached and null is returned.

One feature of the StringTokenizer class is that it is easily stacked. Look at the method named getColor below, which is lines 10 through 18 of the above method.

      
      /**
       * Parse a color tuple "r,g,b" into an AWT Color object.
       */
 1    Color getColor(String data) {
 2        int red, green, blue;
 3        StringTokenizer st = new StringTokenizer(data, ",");
 4        try {
 5            red = Integer.parseInt(st.nextToken());
 6            green = Integer.parseInt(st.nextToken());
 7            blue = Integer.parseInt(st.nextToken());
 8        } catch (Exception e) {
 9            return null; // (ERROR STATE) could not parse it
10        }
11        return new Color(red, green, blue); // (END STATE) done.
12    }      

A slightly more complex parser is shown in the code below. This parser is implemented in the method getColors, which is defined to return an array of Color objects.

      
      /**
       * Parse a set of colors "r1,g1,b1:r2,g2,b2:...:rn,gn,bn" into
       * an array of AWT Color objects.
       */
 1    Color[] getColors(String data) {
 2        Vector accum = new Vector();
 3        Color cl, result[];
 4        StringTokenizer st = new StringTokenizer(data, ": ");
 5        while (st.hasMoreTokens()) {
 6            cl = getColor(st.nextToken());
 7            if (cl != null) {
 8                accum.addElement(cl);
 9            } else {
10                System.out.println("Error - bad color.");
11            }
12        }
13        if (accum.size() == 0)
14            return null;
15        result = new Color[accum.size()];
16        for (int i = 0; i < accum.size(); i++) {
17            result[i] = (Color) accum.elementAt(i);
18        }
19        return result;
20    }

In the method above, which is only slightly different from the getColor method, the code in lines 4 through 12 create a new Tokenizer to extract tokens surrounded by the colon (:) character. As you can read in the documentation comment for the method, this method expects color tuples to be separated by colons. Each call to nextToken in the StringTokenizer class will return a new token until the string has been exhausted. The tokens returned will be the strings of numbers separated by commas; these token strings are fed to getColor, which then extracts a color from the three numbers. Creating a new StringTokenizer object using a token returned by another StringTokenizer object allows the parser code we've written to be a bit more sophisticated about how it interprets the string input.

As useful as it is, you will eventually exhaust the abilities of the StringTokenizer class and have to move on to its big brother StreamTokenizer.

The StreamTokenizer class

As the name of the class suggests, a StreamTokenizer object expects its input to come from an InputStream class. Like the StringTokenizer above, this class converts the input stream into chunks that your parsing code can interpret, but that is where the similarity ends.

StreamTokenizer is a table-driven lexical analyzer. This means that every possible input character is assigned a significance, and the scanner uses the significance of the current character to decide what to do. In the implementation of this class, characters are assigned one of three categories. These are:

  • Whitespace characters -- their lexical significance is limited to separating words

  • Word characters -- they should be aggregated when they are adjacent to another word character

  • Ordinary characters -- they should be returned immediately to the parser

Imagine the implementation of this class as a simple state machine that has two states -- idle and accumulate. In each state the input is a character from one of the above categories. The class reads the character, checks its category and does some action, and moves on to the next state. The following table shows this state machine.

StateInputActionNew state
idleword characterpush back characteraccumulate
ordinary characterreturn characteridle
whitespace characterconsume characteridle
accumulateword characteradd to current wordaccumulate
ordinary character

return current word

push back character

idle
whitespace character

return current word

consume character

idle

On top of this simple mechanism the StreamTokenizer class adds several heuristics. These include number processing, quoted string processing, comment processing, and end-of-line processing.

The first example is number processing. Certain character sequences can be interpreted as representing a numerical value. For example, the sequence of characters 1, 0, 0, ., and 0 adjacent to each other in the input stream represent the numerical value 100.0. When all of the digit characters (0 through 9), the dot character (.), and the minus (-) character are specified as being part of the word set, the StreamTokenizer class can be told to interpret the word it is about to return as a possible number. Setting this mode is achieved by calling the parseNumbers method on the tokenizer object that you instantiated (this is the default). If the analyzer is in the accumulate state, and the next character would not be part of a number, the currently accumulated word is checked to see if it is a valid number. If it is valid, it is returned, and the scanner moves to the next appropriate state.

The next example is quoted string processing. It is often desirable to pass a string that is surrounded by a quotation character (typically double (") or single (') quote) as a single token. The StreamTokenizer class allows you to specify any character as being a quoting character. By default they are the single quote (') and double quote (") characters. The state machine is modified to consume characters in the accumulate state until either another quote character or an end-of-line character is processed. To allow you to quote the quote character, the analyzer treats the quote character preceded by a back slash (\) in the input stream and inside a quotation as a word character.

The third example is the processing of comments. Comments are generally considered to be text that is inserted into the input stream for the human reader and are irrelevant to the machine consumer of the data. The StreamTokenizer supports ignoring comments by eliminating comments on the input and never returning them. The default comment processing is to use the slash character (/) to delineate the start of a comment and to use the end of the line to delineate the end of the comment. In many situations this is fine, however at times you may want to process C-like comments. The class supports this if you turn off generic comment processing and then enable processing of either slash star (/* ... */) comments, slash slash (// ...) comments, or both. For these methods to work, the slash character (/) must not be set to the comment character. As in the case of quotes, zero or more characters can be specified as the comment character. When the comment character is encountered, the rest of the line is silently discarded.

The final example you can enable is treatment of the end of each input line. If it is important to know when the end of a line is reached, you can tell the tokenizer to return an indication to that effect. You could also simply declare the ASCII linefeed character (0x0a) as an ordinary character, except that on platforms that ended lines with just an ASCII carriage return (0x0d), you would not see end-of-line indications. So the analyzer notes internally when an appropriate line terminating character has been reached and then returns that indication to your class. This feature has the additional benefit of hiding a small piece of platform-dependent behavior.

I know this all sounds tremendously complicated, so to help you out I've put together a StreamTokenizer exerciser applet along the same lines as the StringTokenizer exerciser above. The source to the StreamTokenizer exerciser applet is here. This applet is much larger than the StringTokenizer exerciser, as it offers many additional capabilities. The applet is on the page below, and you may want to open up a new copy of the browser on this page so that you can keep the applet up while rereading the text.

It's pretty straightforward to operate the applet. On the left-hand side is an input text area. Type in one or more lines of text to be analyzed here. The middle box is where the list of tokens will appear, and these will be prefaced with the characters W for a word token, O for an ordinary character token, Q for a quote token, N for a number token, and <EOL> or <EOF> for the end-of-line and end-of-file meta tokens. The third box shows how the ASCII characters are divided up into "O" ordinary, "W" word, and "B" blank (or whitespace) characters. To read this list, note that each entry is applied in sequence starting with the first one and moving down. So the item "B[0, 32]" is read "The characters with values 0 through 32 are treated as whitespace." The item "W[48, 122]" is read "The characters with values between 48 and 122 are treated as word characters." Later in the list you will see the item "O[91,96]," which means that characters 91 through 96 are treated as ordinary characters. Because this item is lower in the list than the word item above it, it overrides that word item for characters in this range. These character ranges and the checkboxes on the right-hand side are only used if the check box labelled "custom syntax" is selected; however, they are set to the "default" syntax that StreamTokenizer uses. This allows you to see the rules that are in effect, even if you don't have a custom syntax selected. On the bottom of the applet are three sets of boxes; you can use these to add new characters to the word, ordinary, and blank character ranges. Finally, the row of command buttons in the middle carry out the exact functions described by their names.

You need a Java-enabled browser to see this applet.

You may want to play around a bit with the applet. Here are some ideas to get you started.

  • Click on the custom syntax button. In the left-hand White Space Character box, type the letter e and then click Add Blank Chars. Now type in the phrase "this test is eerie," and click the Tokenize! button.

  • Click Reset Syntax! and then type in "this is /* a comment */ a test," and tokenize that. Now change the comment character to c and click the Tokenize! button. Now clear the comment character and click the box labelled "/* Comments," and tokenize it again.

Wrapping up

The two classes -- StringTokenizer and StreamTokenizer -- are very useful for parsing information that is in textual form. Once you get to know them they may become a standard part of your programming technique. Next month I will walk through the design of a simple application that uses a StreamTokenizer class in its operation.

Chuck McManis is currently the director of system software at FreeGate Corp. FreeGate is a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, McManis was a member of the Java group. He joined the Java group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft developing networking applications, where he did the initial design of NIS+.

Learn more about this topic