Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

  • Print
  • Feedback

Page 2 of 5

The applet below is a simple StringTokenizer exerciser. The source to the StringTokenizer applet is here. To use the applet, type some text to be analyzed into the input string area, then type a string consisting of separator characters in the Separator String area. Finally, click on the Tokenize! button. The result will show up in the token list below the input string and will be organized as one token per line.

You need a Java-enabled browser to see this applet.


Consider as an example a string, "a, b, d", passed to a StringTokenizer object that has been constructed with a comma (,) as the separator character. If you put these values in the exerciser applet above you will see that the Tokenizer object returns the strings "a," "b," and "d." If your intention was to note that one parameter was missing, you may have been suprised to see no indication of this in the token sequence. The ability to detect missing tokens is enabled by the Return Separator boolean that can be set when you create a Tokenizer object. With this parameter set when the Tokenizer is constructed, each separator is also returned. Click the checkbox for Return Separator in the applet above, and leave the string and the separator alone. Now the Tokenizer returns "a, comma, b, comma, comma, and d." By noting that you get two separator characters in sequence, you can determine that a "null" token was included in the input string.

The trick to successfully using StringTokenizer in a parser is defining the input in such a way that the delimiter character does not appear in the data. Clearly you can avoid this restriction by designing for it in your application. The method definition below can be used as part of an applet that accepts a color in the form of red, green, and blue values in its parameter stream.

    
    /**
     * Parse a parameter of the form "10,20,30" as an
     * RGB tuple for a color value.
     */
 1    Color getColor(String name) {
 2        String data;
 3        StringTokenizer st;
 4        int red, green, blue;
 5        
 6        data = getParameter(name);
 7        if (data == null)
 8            return null;
 9            
10        st = new StringTokenizer(data, ",");
11        try {
12            red = Integer.parseInt(st.nextToken());
13            green = Integer.parseInt(st.nextToken());
14            blue = Integer.parseInt(st.nextToken());
15        } catch (Exception e) {
16            return null; // (ERROR STATE) could not parse it
17        }
18        return new Color(red, green, blue); // (END STATE) done.
19    }             


The code above implements a very simple parser that reads the string "number, number, number" and returns a new Color object. In line 10, the code creates a new StringTokenizer object that contains the parameter data (assume this method is part of an applet), and a separator character list that consists of commas. Then in lines 12, 13, and 14, each token is extracted from the string and converted into a number using the Integer parseInt method. These conversions are surrounded by a try/catch block in case the number strings were not valid numbers or the Tokenizer throws an exception because it has run out of tokens. If all of the numbers convert, the end state is reached and a Color object is returned; otherwise the error state is reached and null is returned.

One feature of the StringTokenizer class is that it is easily stacked. Look at the method named getColor below, which is lines 10 through 18 of the above method.

      
      /**
       * Parse a color tuple "r,g,b" into an AWT Color object.
       */
 1    Color getColor(String data) {
 2        int red, green, blue;
 3        StringTokenizer st = new StringTokenizer(data, ",");
 4        try {
 5            red = Integer.parseInt(st.nextToken());
 6            green = Integer.parseInt(st.nextToken());
 7            blue = Integer.parseInt(st.nextToken());
 8        } catch (Exception e) {
 9            return null; // (ERROR STATE) could not parse it
10        }
11        return new Color(red, green, blue); // (END STATE) done.
12    }      


A slightly more complex parser is shown in the code below. This parser is implemented in the method getColors, which is defined to return an array of Color objects.

      
      /**
       * Parse a set of colors "r1,g1,b1:r2,g2,b2:...:rn,gn,bn" into
       * an array of AWT Color objects.
       */
 1    Color[] getColors(String data) {
 2        Vector accum = new Vector();
 3        Color cl, result[];
 4        StringTokenizer st = new StringTokenizer(data, ": ");
 5        while (st.hasMoreTokens()) {
 6            cl = getColor(st.nextToken());
 7            if (cl != null) {
 8                accum.addElement(cl);
 9            } else {
10                System.out.println("Error - bad color.");
11            }
12        }
13        if (accum.size() == 0)
14            return null;
15        result = new Color[accum.size()];
16        for (int i = 0; i < accum.size(); i++) {
17            result[i] = (Color) accum.elementAt(i);
18        }
19        return result;
20    }


In the method above, which is only slightly different from the getColor method, the code in lines 4 through 12 create a new Tokenizer to extract tokens surrounded by the colon (:) character. As you can read in the documentation comment for the method, this method expects color tuples to be separated by colons. Each call to nextToken in the StringTokenizer class will return a new token until the string has been exhausted. The tokens returned will be the strings of numbers separated by commas; these token strings are fed to getColor, which then extracts a color from the three numbers. Creating a new StringTokenizer object using a token returned by another StringTokenizer object allows the parser code we've written to be a bit more sophisticated about how it interprets the string input.

As useful as it is, you will eventually exhaust the abilities of the StringTokenizer class and have to move on to its big brother StreamTokenizer.

The StreamTokenizer class

As the name of the class suggests, a StreamTokenizer object expects its input to come from an InputStream class. Like the StringTokenizer above, this class converts the input stream into chunks that your parsing code can interpret, but that is where the similarity ends.

StreamTokenizer is a table-driven lexical analyzer. This means that every possible input character is assigned a significance, and the scanner uses the significance of the current character to decide what to do. In the implementation of this class, characters are assigned one of three categories. These are:

  • Whitespace characters -- their lexical significance is limited to separating words

  • Word characters -- they should be aggregated when they are adjacent to another word character

  • Ordinary characters -- they should be returned immediately to the parser


Imagine the implementation of this class as a simple state machine that has two states -- idle and accumulate. In each state the input is a character from one of the above categories. The class reads the character, checks its category and does some action, and moves on to the next state. The following table shows this state machine.

  • Print
  • Feedback

Resources
  • The two applets included above were designed an implemented using Visual Cafe PR2m which is available on Windows 95 for free at:
    http://cafe.symantec.com.
  • A good discussion of lexical analysis can be found in the book CompilersPrinciples, Techniques and Tools, by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, ISBN 0-201-10088-6:
    http://www.awl.com
  • The source for past Java In Depth columns can be found at:
    http://www.mcmanis.com/~cmcmanis/java/javaworld.