Java 101: Java's character and assorted string classes support text-processing

Explore Character, String, StringBuffer, and StringTokenizer

1 2 3 4 5 Page 4
Page 4 of 5

For another practical illustration of StringBuffer's append(String str) method, as well as StringBuffer(int length), append(char c), and deleteCharAt(int index), I created an Editor application that demonstrates a basic line-oriented text editor:

Listing 5: Editor.java

// Editor.java
import java.io.IOException;
class Editor
{
   public static int MAXLINES = 100;
   static int curline = -1; // Current line.
   static int lastline = -1; // Last appended line index.
   // The following array holds all lines of text. (Maximum is MAXLINES.)
   static StringBuffer [] lines = new StringBuffer [MAXLINES];
   static
   {
      // We assume 80-character lines. But who knows? Because StringBuffers
      // dynamically expand, you could end up with some very long lines.
      for (int i = 0; i < lines.length; i++)
           lines [i] = new StringBuffer (80);
   }
   public static void main (String [] args)
   {
      do
      {
          // Prompt user to enter a command
          System.out.print ("C: ");
          // Obtain the command, and make sure there is no leading/trailing
          // white space
          String cmd = readString ().trim ();
          // Process command
          if (cmd.equalsIgnoreCase ("QUIT"))
              break;
          if (cmd.equalsIgnoreCase ("ADD"))
          {
              if (lastline == MAXLINES - 1)
              {
                  System.out.println ("FULL");
                  continue;
              }
              String line = readString ();
              lines [++lastline].append (line);
              curline = lastline;
              continue;
          }
          if (cmd.equalsIgnoreCase ("DELFCH"))
          {
              if (curline > -1 && lines [curline].length () > 0)
                  lines [curline].deleteCharAt (0);
              continue;
          }
          if (cmd.equalsIgnoreCase ("DUMP"))
              for (int i = 0; i <= lastline; i++)
                   System.out.println (i + ": " + lines [i]);
      }
      while (true);
   }
   static String readString ()
   {
      StringBuffer sb = new StringBuffer (80);
      try
      {
         do
         {
             int ch = System.in.read ();
             if (ch == '\n')
                 break;
             sb.append ((char) ch);
         }
         while (true);
      }
      catch (IOException e)
      {
      }
      return sb.toString ();
   }
}

To see how Editor works, type java Editor. Here is one example of this program's output:

C: add
some text
C: dump
0: some text
C: delfch
C: dump
0: ome text
C: quit

Among Editor's various commands, add appends a line of text to the StringBuffer strings array, dump dumps all lines to the standard output device, and delfch removes the current line's first character. Obviously, delfch is not very useful: a better program would specify an index after the command name and delete the character at that index. However, before you can accomplish that task, you must learn about the StringTokenizer class.

The StringTokenizer class

What do the Java compiler, a text-based adventure game, and a Linux shell program have in common? Each program contains code that extracts, from user-specified text, the fundamental character sequences, or tokens, such as identifiers and punctuation (compiler), game-play instructions (adventure game), or command name and arguments (Linux shell). Java accomplishes the token extraction process—known as string tokenizing because user-specified text exists as one or more character strings— via the StringTokenizer class.

Unlike the frequently-used Character, String, and StringBuffer language classes, the less-frequently-used StringTokenizer utility class exists in package java.util and requires an explicit import directive to import that class into a program.

StringTokenizer objects

Before a program can extract tokens from a string, the program must create a StringTokenizer object by calling one of the following constructors:

  • public StringTokenizer(String s), which creates a StringTokenizer that extracts tokens from the s-referenced String. Furthermore, the constructor specifies the space character (' '), tab character ('\t'), new-line character ('\n'), carriage-return character ('\r'), and form-feed character ('\f') as delimiters—characters that separate tokens from each other. Delimiters do not return as tokens.
  • public StringTokenizer(String s, String delim), which is identical to the previous constructor except you also specify a string of delimiter characters via the delim-referenced String. During string tokenizing, StringTokenizer ignores all delimiter characters as it searches for the next token's beginning. Delimiters do not return as tokens.
  • public StringTokenizer(String s, String delim, boolean returnDelim), which resembles the previous constructors except you also specify whether delimiter characters should return as tokens. Delimiter characters return when you pass true to returnDelim.

Examine the following fragment to learn how these constructors create StringTokenizer objects:

String s = "A sentence to tokenize.|A second sentence.";
StringTokenizer stok1 = new StringTokenizer (s);
StringTokenizer stok2 = new StringTokenizer (s, "|");
StringTokenizer stok3 = new StringTokenizer (s, " |", true);

stok1 references a StringTokenizer that extracts tokens from the s-referenced String—and also recognizes space, tab, new-line, carriage-return, and form-feed characters as delimiters. stok2 references a StringTokenizer that also extracts tokens from s. This time, however, only a vertical bar character (|) classifies as a delimiter. Finally, in the stok3-referenced StringTokenizer, the white space and vertical bar classify as delimiters and return as tokens. Now that these StringTokenizers exist, how do you extract tokens from their s-referenced Strings? Let's find out.

Token extraction

StringTokenizer provides four methods for extracting tokens: public int countTokens(), public boolean hasMoreTokens(), public String nextToken(), and public String nextToken(String delim). The countTokens() method returns an integer containing a count of a string's tokens. Use this return value to determine the maximum tokens to extract. However, you should call hasMoreTokens() to determine when to end tokenizing because countTokens() is undependable (as you will see). hasMoreTokens() returns a Boolean true value if at least one more token exists to extract. Otherwise, that method returns false. Finally, the nextToken() and nextToken(String delim) methods return a String's next token. But if no more tokens are available, either method throws a NoSuchElementException object. nextToken() and nextToken(String delim) differ only in that nextToken(String delim) lets you reset a StringTokenizer's delimiter characters to those characters in the delim-referenced String. Given this information, the following code, which builds on the previous fragment, shows how to use the previous three StringTokenizers to extract a string's tokens:

System.out.println ("count1 = " + stok1.countTokens ());
while (stok1.hasMoreTokens ())
   System.out.println ("token = " + stok1.nextToken ());
System.out.println ("\r\ncount2 = " + stok2.countTokens ());
while (stok2.hasMoreTokens ())
   System.out.println ("token = " + stok2.nextToken ());
System.out.println ("\r\ncount3 = " + stok3.countTokens ());
while (stok3.hasMoreTokens ())
   System.out.println ("token = " + stok3.nextToken ());

The fragment above divides into three parts. The first part focuses on stok1. After retrieving and printing a token count, a while loop calls nextToken() to extract all tokens if hasMoreTokens() returns true. The second and third parts use identical logic for the other StringTokenizers. If you execute the code fragment, you observe the following output:

count1 = 6
token = A
token = sentence
token = to
token = tokenize.|A
token = second
token = sentence.
count2 = 2
token = A sentence to tokenize.
token = A second sentence.
count3 = 13
token = A
token =  
token = sentence
token =  
token = to
token =  
token = tokenize.
token = |
token = A
token =  
token = second
token =  
token = sentence.

The output above reveals three different token counts for the same string. The counts differ because the sets of delimiters differ. For stok1, the default delimiter set applies. For stok2, only one delimiter is present: the vertical bar. stok3 records a space and a vertical bar as its delimiters. The output's final portion reveals that the space and vertical bar delimiters return as tokens due to passing true as returnDelim's value in the stok3 call.

Earlier, I cautioned you against relying on countTokens() for determining the number of tokens to extract. countTokens()'s return value is often meaningless when a program dynamically changes a StringTokenizer's delimiters with a nextToken(String delim) method call, as the following fragment demonstrates:

String record = "Ricard Santos,Box 99,'Sacramento,CA'";
StringTokenizer st = new StringTokenizer (record, ",");
int ntok = st.countTokens ();
System.out.println ("Number of tokens = " + ntok);
for (int i = 0; i < ntok; i++)
{
     String token = st.nextToken ();
     System.out.println (token);
     if (token.startsWith ("Box"))
         st.nextToken ("'"); // Throw away comma between Box 99 and
                             // 'Sacramento,CA'
}

The code creates a String that simulates a database record. Within that record, commas delimit fields (record portions). Although there are four commas, only three fields exist: a name, a box number, and a city-state. A pair of single quotes surround the city-state field to indicate that the comma between Sacramento and CA is part of the field.

After creating a StringTokenizer recognizing only comma characters as delimiters, the current thread counts the number of tokens, which subsequently print. The thread then uses that count to control the duration of the loop that extracts and prints tokens. When the Box 99 token returns, the thread executes st.nextToken ("'"); to change the delimiter from a comma to a single quote and discard the comma token between Box 99 and 'Sacramento,CA'. The comma token returns because st.nextToken ("'"); first replaces the comma with a single quote before extracting the next token. The code produces this output:

Number of tokens = 4
Ricard Santos
Box 99
Sacramento,CA
Exception in thread "main" java.util.NoSuchElementException
        at java.util.StringTokenizer.nextToken(StringTokenizer.java:232)
        at STDemo.main(STDemo.java:18)

The output indicates four tokens because three commas imply four tokens. But after displaying three tokens, a NoSuchElementException object is thrown from st.nextToken ();. The exception occurs because the program assumes that countTokens()'s return value indicates the exact number of tokens to extract. However, countTokens() can only base its count on the current set of delimiters. Because the fragment changes those delimiters during the loop, via st.nextToken ("'");, method countTokens()'s return value is no longer valid.

Caution
Do not use countTokens()'s return value to control a string tokenization loop's duration if the loop changes the set of delimiters via a nextToken(String delim) method call. Failure to heed that advice often leads to one of the nextToken() methods throwing a NoSuchElementException object and the program terminating prematurely.

For a practical demonstration of StringTokenizer's methods, I created a PigLatin application that translates English text to its pig Latin equivalent. For those unfamiliar with the pig Latin game, this coded language moves a word's first letter to its end and then adds ay. For example: computer becomes omputercay; Java becomes Avajay, etc. Punctuation is not affected. Listing 6 presents PigLatin's source code:

Listing 6: PigLatin.java

// PigLatin.java
import java.util.StringTokenizer;
class PigLatin
{
   public static void main (String [] args)
   {
      if (args.length != 1)
      {
          System.err.println ("usage: java PigLatin phrase");
          return;
      }
      StringTokenizer st = new StringTokenizer (args [0], " \t:;,.-?!");
      while (st.hasMoreTokens ())
      {
         StringBuffer sb = new StringBuffer (st.nextToken ());
         sb.append (sb.charAt (0));
         sb.append ("ay");
         sb.deleteCharAt (0);
         System.out.print (sb.toString () + " ");
      }
      System.out.print ("\r\n");
   }
}

To see what Hello, world! looks like in pig Latin, execute java PigLatin "Hello, world!". You see the following output:

elloHay orldWay 

According to pig Latin's rules, the output is not quite correct. First, the wrong letters are capitalized. Second, the punctuation is missing. The correct output is:

Ellohay, Orldway! 

Use what you've learned in this article to fix those problems.

Review

Java's Character, String, StringBuffer, and StringTokenizer classes support text-processing programs. Such programs use Character to indirectly store char variables in data structure objects and access a variety of character-oriented utility methods; use String to represent and manipulate immutable strings; use StringBuffer to represent and manipulate mutable strings; and use StringTokenizer to extract a string's tokens.

1 2 3 4 5 Page 4
Page 4 of 5