Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

  • Print
  • Feedback

Page 4 of 5

The final example you can enable is treatment of the end of each input line. If it is important to know when the end of a line is reached, you can tell the tokenizer to return an indication to that effect. You could also simply declare the ASCII linefeed character (0x0a) as an ordinary character, except that on platforms that ended lines with just an ASCII carriage return (0x0d), you would not see end-of-line indications. So the analyzer notes internally when an appropriate line terminating character has been reached and then returns that indication to your class. This feature has the additional benefit of hiding a small piece of platform-dependent behavior.

I know this all sounds tremendously complicated, so to help you out I've put together a StreamTokenizer exerciser applet along the same lines as the StringTokenizer exerciser above. The source to the StreamTokenizer exerciser applet is here. This applet is much larger than the StringTokenizer exerciser, as it offers many additional capabilities. The applet is on the page below, and you may want to open up a new copy of the browser on this page so that you can keep the applet up while rereading the text.

It's pretty straightforward to operate the applet. On the left-hand side is an input text area. Type in one or more lines of text to be analyzed here. The middle box is where the list of tokens will appear, and these will be prefaced with the characters W for a word token, O for an ordinary character token, Q for a quote token, N for a number token, and <EOL> or <EOF> for the end-of-line and end-of-file meta tokens. The third box shows how the ASCII characters are divided up into "O" ordinary, "W" word, and "B" blank (or whitespace) characters. To read this list, note that each entry is applied in sequence starting with the first one and moving down. So the item "B[0, 32]" is read "The characters with values 0 through 32 are treated as whitespace." The item "W[48, 122]" is read "The characters with values between 48 and 122 are treated as word characters." Later in the list you will see the item "O[91,96]," which means that characters 91 through 96 are treated as ordinary characters. Because this item is lower in the list than the word item above it, it overrides that word item for characters in this range. These character ranges and the checkboxes on the right-hand side are only used if the check box labelled "custom syntax" is selected; however, they are set to the "default" syntax that StreamTokenizer uses. This allows you to see the rules that are in effect, even if you don't have a custom syntax selected. On the bottom of the applet are three sets of boxes; you can use these to add new characters to the word, ordinary, and blank character ranges. Finally, the row of command buttons in the middle carry out the exact functions described by their names.

You need a Java-enabled browser to see this applet.

You may want to play around a bit with the applet. Here are some ideas to get you started.

  • Print
  • Feedback

Resources
  • The two applets included above were designed an implemented using Visual Cafe PR2m which is available on Windows 95 for free at:
    http://cafe.symantec.com.
  • A good discussion of lexical analysis can be found in the book CompilersPrinciples, Techniques and Tools, by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, ISBN 0-201-10088-6:
    http://www.awl.com
  • The source for past Java In Depth columns can be found at:
    http://www.mcmanis.com/~cmcmanis/java/javaworld.