Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

  • Print
  • Feedback

Page 3 of 5

State Input Action New state
idle word character push back character accumulate
ordinary character return character idle
whitespace character consume character idle
accumulate word character add to current word accumulate
ordinary character return current word
push back character
idle
whitespace character return current word
consume character
idle


On top of this simple mechanism the StreamTokenizer class adds several heuristics. These include number processing, quoted string processing, comment processing, and end-of-line processing.

The first example is number processing. Certain character sequences can be interpreted as representing a numerical value. For example, the sequence of characters 1, 0, 0, ., and 0 adjacent to each other in the input stream represent the numerical value 100.0. When all of the digit characters (0 through 9), the dot character (.), and the minus (-) character are specified as being part of the word set, the StreamTokenizer class can be told to interpret the word it is about to return as a possible number. Setting this mode is achieved by calling the parseNumbers method on the tokenizer object that you instantiated (this is the default). If the analyzer is in the accumulate state, and the next character would not be part of a number, the currently accumulated word is checked to see if it is a valid number. If it is valid, it is returned, and the scanner moves to the next appropriate state.

The next example is quoted string processing. It is often desirable to pass a string that is surrounded by a quotation character (typically double (") or single (') quote) as a single token. The StreamTokenizer class allows you to specify any character as being a quoting character. By default they are the single quote (') and double quote (") characters. The state machine is modified to consume characters in the accumulate state until either another quote character or an end-of-line character is processed. To allow you to quote the quote character, the analyzer treats the quote character preceded by a back slash (\) in the input stream and inside a quotation as a word character.

The third example is the processing of comments. Comments are generally considered to be text that is inserted into the input stream for the human reader and are irrelevant to the machine consumer of the data. The StreamTokenizer supports ignoring comments by eliminating comments on the input and never returning them. The default comment processing is to use the slash character (/) to delineate the start of a comment and to use the end of the line to delineate the end of the comment. In many situations this is fine, however at times you may want to process C-like comments. The class supports this if you turn off generic comment processing and then enable processing of either slash star (/* ... */) comments, slash slash (// ...) comments, or both. For these methods to work, the slash character (/) must not be set to the comment character. As in the case of quotes, zero or more characters can be specified as the comment character. When the comment character is encountered, the rest of the line is silently discarded.

  • Print
  • Feedback

Resources
  • The two applets included above were designed an implemented using Visual Cafe PR2m which is available on Windows 95 for free at:
    http://cafe.symantec.com.
  • A good discussion of lexical analysis can be found in the book CompilersPrinciples, Techniques and Tools, by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, ISBN 0-201-10088-6:
    http://www.awl.com
  • The source for past Java In Depth columns can be found at:
    http://www.mcmanis.com/~cmcmanis/java/javaworld.