Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 3 of 5

State Input Action New state
idle word character push back character accumulate
ordinary character return character idle
whitespace character consume character idle
accumulate word character add to current word accumulate
ordinary character return current word
push back character
idle
whitespace character return current word
consume character
idle


On top of this simple mechanism the StreamTokenizer class adds several heuristics. These include number processing, quoted string processing, comment processing, and end-of-line processing.

The first example is number processing. Certain character sequences can be interpreted as representing a numerical value. For example, the sequence of characters 1, 0, 0, ., and 0 adjacent to each other in the input stream represent the numerical value 100.0. When all of the digit characters (0 through 9), the dot character (.), and the minus (-) character are specified as being part of the word set, the StreamTokenizer class can be told to interpret the word it is about to return as a possible number. Setting this mode is achieved by calling the parseNumbers method on the tokenizer object that you instantiated (this is the default). If the analyzer is in the accumulate state, and the next character would not be part of a number, the currently accumulated word is checked to see if it is a valid number. If it is valid, it is returned, and the scanner moves to the next appropriate state.

The next example is quoted string processing. It is often desirable to pass a string that is surrounded by a quotation character (typically double (") or single (') quote) as a single token. The StreamTokenizer class allows you to specify any character as being a quoting character. By default they are the single quote (') and double quote (") characters. The state machine is modified to consume characters in the accumulate state until either another quote character or an end-of-line character is processed. To allow you to quote the quote character, the analyzer treats the quote character preceded by a back slash (\) in the input stream and inside a quotation as a word character.

The third example is the processing of comments. Comments are generally considered to be text that is inserted into the input stream for the human reader and are irrelevant to the machine consumer of the data. The StreamTokenizer supports ignoring comments by eliminating comments on the input and never returning them. The default comment processing is to use the slash character (/) to delineate the start of a comment and to use the end of the line to delineate the end of the comment. In many situations this is fine, however at times you may want to process C-like comments. The class supports this if you turn off generic comment processing and then enable processing of either slash star (/* ... */) comments, slash slash (// ...) comments, or both. For these methods to work, the slash character (/) must not be set to the comment character. As in the case of quotes, zero or more characters can be specified as the comment character. When the comment character is encountered, the rest of the line is silently discarded.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comments (3)
Login
Forgot your account info?

Comment from reader (Narendra Sharma)By Anonymous on August 11, 2009, 6:32 amThis is very interesting article, it surely help those who are trying to develop Hand Written Lexical Analyzer. I found it wonderful... looking for more interesting...

Reply | Read entire comment

CSC302: CONCURRENT PROGRAMMING Assignment: Develop an applicatioBy Anonymous on March 9, 2009, 5:38 amplease i would like you to help me solve this

Reply | Read entire comment

plsBy Anonymous on February 21, 2009, 8:39 pmcan u help me to write a code in java that can implement tokens.

Reply | Read entire comment

View all comments

Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources
  • The two applets included above were designed an implemented using Visual Cafe PR2m which is available on Windows 95 for free at:
    http://cafe.symantec.com.
  • A good discussion of lexical analysis can be found in the book CompilersPrinciples, Techniques and Tools, by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, ISBN 0-201-10088-6:
    http://www.awl.com
  • The source for past Java In Depth columns can be found at:
    http://www.mcmanis.com/~cmcmanis/java/javaworld.