Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

  • Print
  • Feedback

Lexical analysis and parsing

When writing Java applications, one of the more common things you will be required to produce is a parser. Parsers range from simple to complex and are used for everything from looking at command-line options to interpreting Java source code. In JavaWorld's December issue, I showed you Jack, an automatic parser generator that converts high-level grammar specifications into Java classes that implement the parser described by those specifications. This month I'll show you the resources that Java provides to write targeted lexical analyzers and parsers. These somewhat simpler parsers fill the gap between simple string comparison and the complex grammars that Jack compiles.

The purpose of lexical analyzers is to take a stream of input characters and decode them into higher level tokens that a parser can understand. Parsers consume the output of the lexical analyzer and operate by analyzing the sequence of tokens returned. The parser matches these sequences to an end state, which may be one of possibly many end states. The end states define the goals of the parser. When an end state is reached, the program using the parser does some action -- either setting up data structures or executing some action-specific code. Additionally, parsers can detect -- from the sequence of tokens that have been processed -- when no legal end state can be reached; at that point the parser identifies the current state as an error state. It is up to the application to decide what action to take when the parser identifies either an end state or an error state.

The standard Java class base includes a couple of lexical analyzer classes, however it does not define any general-purpose parser classes. In this column I'll take an in-depth look at the lexical analyzers that come with Java.

Java's lexical analyzers

The Java Language Specification, version 1.0.2, defines two lexical analyzer classes, StringTokenizer and StreamTokenizer. From their names you can deduce that StringTokenizer uses String objects as its input, and StreamTokenizer uses InputStream objects.

The StringTokenizer class

Of the two available lexical analyzer classes, the easiest to understand is StringTokenizer. When you construct a new StringTokenizer object, the constructor method nominally takes two values -- an input string and a delimiter string. The class then constructs a sequence of tokens that represents the characters between the delimiter characters.

As a lexical analyzer, StringTokenizer could be formally defined as shown below.

[delim1,delim2,...,delimN]    ::  Token


This definition consists of a regular expression that matches every character except the delimiter characters. All adjacent matching characters are collected into a single token and returned as a Token.

The most common use of the StringTokenizer class is for separating out a set of parameters -- such as a comma-separated list of numbers. StringTokenizer is ideal in this role because it removes the separators and returns the data. The StringTokenizer class also provides a mechanism for identifying lists in which there are "null" tokens. You would use null tokens in applications in which some parameters either have default values or are not required to be present in all cases.

  • Print
  • Feedback

Resources
  • The two applets included above were designed an implemented using Visual Cafe PR2m which is available on Windows 95 for free at:
    http://cafe.symantec.com.
  • A good discussion of lexical analysis can be found in the book CompilersPrinciples, Techniques and Tools, by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, ISBN 0-201-10088-6:
    http://www.awl.com
  • The source for past Java In Depth columns can be found at:
    http://www.mcmanis.com/~cmcmanis/java/javaworld.