Lexical analysis and Java: Part 1

Learn how to convert human readable text into machine readable data using the StringTokenizer and StreamTokenizer classes

1 2 Page 2
Page 2 of 2

The third example is the processing of comments. Comments are generally considered to be text that is inserted into the input stream for the human reader and are irrelevant to the machine consumer of the data. The StreamTokenizer supports ignoring comments by eliminating comments on the input and never returning them. The default comment processing is to use the slash character (/) to delineate the start of a comment and to use the end of the line to delineate the end of the comment. In many situations this is fine, however at times you may want to process C-like comments. The class supports this if you turn off generic comment processing and then enable processing of either slash star (/* ... */) comments, slash slash (// ...) comments, or both. For these methods to work, the slash character (/) must not be set to the comment character. As in the case of quotes, zero or more characters can be specified as the comment character. When the comment character is encountered, the rest of the line is silently discarded.

The final example you can enable is treatment of the end of each input line. If it is important to know when the end of a line is reached, you can tell the tokenizer to return an indication to that effect. You could also simply declare the ASCII linefeed character (0x0a) as an ordinary character, except that on platforms that ended lines with just an ASCII carriage return (0x0d), you would not see end-of-line indications. So the analyzer notes internally when an appropriate line terminating character has been reached and then returns that indication to your class. This feature has the additional benefit of hiding a small piece of platform-dependent behavior.

I know this all sounds tremendously complicated, so to help you out I've put together a StreamTokenizer exerciser applet along the same lines as the StringTokenizer exerciser above. The source to the StreamTokenizer exerciser applet is here. This applet is much larger than the StringTokenizer exerciser, as it offers many additional capabilities. The applet is on the page below, and you may want to open up a new copy of the browser on this page so that you can keep the applet up while rereading the text.

It's pretty straightforward to operate the applet. On the left-hand side is an input text area. Type in one or more lines of text to be analyzed here. The middle box is where the list of tokens will appear, and these will be prefaced with the characters W for a word token, O for an ordinary character token, Q for a quote token, N for a number token, and <EOL> or <EOF> for the end-of-line and end-of-file meta tokens. The third box shows how the ASCII characters are divided up into "O" ordinary, "W" word, and "B" blank (or whitespace) characters. To read this list, note that each entry is applied in sequence starting with the first one and moving down. So the item "B[0, 32]" is read "The characters with values 0 through 32 are treated as whitespace." The item "W[48, 122]" is read "The characters with values between 48 and 122 are treated as word characters." Later in the list you will see the item "O[91,96]," which means that characters 91 through 96 are treated as ordinary characters. Because this item is lower in the list than the word item above it, it overrides that word item for characters in this range. These character ranges and the checkboxes on the right-hand side are only used if the check box labelled "custom syntax" is selected; however, they are set to the "default" syntax that StreamTokenizer uses. This allows you to see the rules that are in effect, even if you don't have a custom syntax selected. On the bottom of the applet are three sets of boxes; you can use these to add new characters to the word, ordinary, and blank character ranges. Finally, the row of command buttons in the middle carry out the exact functions described by their names.

You need a Java-enabled browser to see this applet.

You may want to play around a bit with the applet. Here are some ideas to get you started.

  • Click on the custom syntax button. In the left-hand White Space Character box, type the letter e and then click Add Blank Chars. Now type in the phrase "this test is eerie," and click the Tokenize! button.

  • Click Reset Syntax! and then type in "this is /* a comment */ a test," and tokenize that. Now change the comment character to c and click the Tokenize! button. Now clear the comment character and click the box labelled "/* Comments," and tokenize it again.

Wrapping up

The two classes -- StringTokenizer and StreamTokenizer -- are very useful for parsing information that is in textual form. Once you get to know them they may become a standard part of your programming technique. Next month I will walk through the design of a simple application that uses a StreamTokenizer class in its operation.

Chuck McManis is currently the director of system software at FreeGate Corp. FreeGate is a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, McManis was a member of the Java group. He joined the Java group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft developing networking applications, where he did the initial design of NIS+.

Learn more about this topic

1 2 Page 2
Page 2 of 2