Jun 15, 2001 1:00 AM PT

Java Tip 112: Improve tokenization of information-rich strings

Exploit StringTokenizer to build a powerful tokenizer

Most Java programmers have used the java.util.StringTokenizer class at some time or another. It is a handy class that basically tokenizes (breaks) the input string based on a separator, and supplies the tokens upon request. (Tokenization is the act of turning sequences of characters into tokens that are understood by your program.)

Although handy, StringTokenizer's functionality is limited. The class simply looks for the delimiter in the input string and breaks the string once the delimiter is found. It does not check for conditions like whether the delimiter is within a substring, nor does it return the token as "" (string length 0) once two consecutive delimiters are found in the input. To fulfill these limitations, the Java 2 platform (JDK 1.2 onwards) comes with the BreakIterator class, which is an improved tokenizer over StringTokenizer. Since such a class is not present in JDK 1.1.x, developers often spend a lot of time writing an original tokenizer that fulfills their requirements. In a large project involving data format handling, it's not uncommon to find many such customized classes floating around.

This tip aims to guide you through writing a sophisticated tokenizer, using the existing StringTokenizer.

StringTokenizer limitations

You can create a StringTokenizer by using any one of the following three constructors:

  1. StringTokenizer(String sInput): Breaks on white space (" ", "\t", "\n").
  2. StringTokenizer(String sInput, String sDelimiter): Breaks on sDelimiter.
  3. StringTokenizer(String sInput, String sDelimiter, boolean bReturnTokens): Breaks on sDelimiter, but if bReturnTokens is set to true, then the delimiter is also returned as a token.

The first constructor doesn't check whether the input string contains substrings. When the string "hello. Today \"I am \" going to my home town" is tokenized on white space, the result is in tokens hello., Today, "I, am, ", going, instead of hello., Today, "I am ", going.

The second constructor doesn't check the consecutive appearance of delimiters. When the string "book, author, publication,,,date published" is tokenized on ",", the StringTokenizer returns four tokens with values book, author, publication, and date published instead of the six values book, author, publication, "", "", and date published, where "" means string of length 0. To get six, you must set the StringTokenizer's bReturnTokens parameter to true.

The feature of setting the parameter to true is important as it gives an idea about the presence of consecutive delimiters. For example, if the data is obtained dynamically and used to update a table in a database, where the input tokens map to the column values, then we can't map the tokens with database columns as we are not sure which columns should be set to "". For example, we want to add records to a table with six columns, and the input data contains two consecutive delimiters. The result from StringTokenizer in this case is five tokens (as two consecutive delimiters represent the token "", which StringTokenizer neglects), and we have to set six fields. We also don't know where the consecutive delimiter appears, thus, which column should be set to "".

The third constructor won't work if a token itself is equal (in length and value) to the delimiter and is in a substring. When the string "book, author, publication,\",\",date published" is tokenized (this string contains , as a token, which is the same as its delimiter) on string ,, the result is book, author, publication, ", ", date published (with six tokens) instead of book, author, publication, , (the comma character), date published (with five tokens). Mind you, even setting the bReturnTokens (third parameter to StringTokenizer) to true won't help you in this case.

Basic needs of a tokenizer

Before dealing with the code, you'll need to know the basic needs of a good tokenizer. Since Java developers are used to the StringTokenizer class, a good tokenizer should have all the useful methods that class provides, such as hasMoreTokens(), nextToken(), countTokens().

The code for this tip is simple and mostly self-explanatory. Basically, I have used the StringTokenizer class (created with bReturnTokens set to true) internally and provided methods mentioned as above. Since in some cases the delimiter is required as tokens (very rare cases) while in some it isn't, the tokenizer must supply the delimiter as a token upon request. When you create a PowerfulTokenizer object, passing only the input string and the delimiter, it internally uses a StringTokenizer with bReturnTokens set to true. (The reason for this is if a StringTokenizer is created without bReturnTokens set to true, then it is limited in overcoming the problems stated earlier). To handle the tokenizer properly, the code checks whether bReturnTokens is set to true in a few places (calculating total number of tokens and nextToken()).

As you might have observed, PowerfulTokenizer implements the Enumeration interface, thus implementing the hasMoreElements() and nextElement() methods that simply delegate the call to hasMoreTokens() and nextToken(), respectively. (By implementing the Enumeration interface, PowerfulTokenizer becomes backward-compatible with StringTokenizer.) Let's consider an example. Say the input string is "hello, Today,,, \"I, am \", going to,,, \"buy, a, book\"" and the delimiter is ,. This string when tokenized returns values as shown in Table 1:

Table 1: Values Returned by Tokenized String
TypeNumber of TokensTokens

StringTokenizer

(bReturnTokens = true)

19hello:,: Today:,:,:,: "I:,: am ":,: going to:,:,:,: "buy:,: a:,: book" (here the character : separates the tokens)

PowerfulTokenizer

(bReturnTokens = true)

13hello:,:Today:,:"":"":I, am:,:going to:,:"":"":buy a book (where "" means string of length 0)

PowerfulTokenizer

(bReturnTokens = false)

9hello:Today:"":"":I am:going to:"":"":buy a book

The input string contains 11 comma (,) characters, out of which three are inside substrings and four appear consecutively (as Today,,, makes two consecutive comma appearances, the first comma being Today's delimiter). Here is the logic in calculating the number of tokens in the PowerfulTokenizer case:

  1. In the case of bReturnTokens=true, multiply the number of delimiters inside substrings by 2 and subtract that amount from the actual total to get the token count. The reason being, for the substring "buy, a, book", StringTokenizer will return five tokens (i.e., buy:,:a:,:book), while PowerfulTokenizer will return one token (i.e., buy, a, book). The difference is four (i.e., 2 * number of delimiters inside the substring). This formula holds well for any substring containing delimiters. Be aware of the special case where the token itself equals the delimiter; this should not decrement the count value.
  2. Similarly, for the case of bReturnTokens=false, subtract the value of the expression [total delimiters (11) - consecutive delimiters (4) + number of delimiters inside substrings (3)] from the actual total (19) to get the token count. Since we don't return the delimiters in this case, they (without appearing consecutively or inside substrings) are of no use to us, and the above formula gives us the total number of tokens (9).

Remember these two formulas, which are the heart of the PowerfulTokenizer. These formulas work for almost all respective cases. However, if you have more complex requirements that aren't suited for these formulas, then you must consider various examples to develop your own formula before rushing into coding.

    // check whether the delimiter is within a substring
    for (int i=1; i<aiIndex.length; i++)
    {
        iIndex = sInput.indexOf(sDelim, iIndex+1);
        if (iIndex == -1)
            break;
        // if the delimiter is within a substring, then parse up to the
        // end of the substring.
        while (sInput.substring(iIndex-iLen, iIndex).equals(sDelim))
        {
            iNextIndex = sInput.indexOf(sDelim, iIndex+1);
            if (iNextIndex == -1)
                break;
            iIndex = iNextIndex;
        }
        aiIndex[i] = iIndex;
        //System.out.println("aiIndex[" + i + "] = " + iIndex);
        if (isWithinQuotes(iIndex))
        {
            if (bIncludeDelim)
                iTokens -= 2;
            else
                iTokens -= 1;
        }
    }

The countTokens() method checks whether the input string contains double quotes. If it does, then it decrements the count and updates the index to the index of the next double quote in that string (as shown in the above code segment). If bReturnTokens is false, then it decrements the count by the total number of nonsubsequent delimiters present in the input string.

    // return "" as token if consecutive delimiters are found.
    if ( (sPrevToken.equals(sDelim)) && (sToken.equals(sDelim)) )
    {
        sPrevToken = sToken;
        iTokenNo++;
        return "";
    }
    // check whether the token itself is equal to the delimiter
    if ( (sToken.trim().startsWith("\"")) && (sToken.length() == 1) )
    {
        // this is a special case when token itself is equal to delimiter
        String sNextToken = oTokenizer.nextToken();
        while (!sNextToken.trim().endsWith("\""))
        {
            sToken += sNextToken;
            sNextToken = oTokenizer.nextToken();
        }
        sToken += sNextToken;
        sPrevToken = sToken;
        iTokenNo++;
        return sToken.substring(1, sToken.length()-1);
    }
    // check whether there is a substring inside the string
    else if ( (sToken.trim().startsWith("\"")) 
                     && (!((sToken.trim().endsWith("\"")) 
                     && (!sToken.trim().endsWith("\"\"")))) )
    {
        if (oTokenizer.hasMoreTokens())
        {
            String sNextToken = oTokenizer.nextToken();
            // check for presence of "\"\"" 
            while (!((sNextToken.trim().endsWith("\"")) 
                         && (!sNextToken.trim().endsWith("\"\""))) )
            {
                sToken += sNextToken;
                if (!oTokenizer.hasMoreTokens())
                {
                    sNextToken = "";
                    break;
                }
                sNextToken = oTokenizer.nextToken();
            }
            sToken += sNextToken;
        }
    }

The nextToken() method gets tokens by using StringTokenizer.nextToken, and checks for the double quote character in the token. If the method finds those characters, it gets more tokens until it doesn't find any with a double quote. It also stores the token in a variable (sPrevToken; see source code) for checking consecutive delimiter appearances. If nextToken() finds consecutive tokens that are equal to the delimiter, then it returns "" (string with length 0) as the token.

Similarly, the hasMoreTokens() method checks whether the number of tokens already requested is less than the total number of tokens.

Save development time

This article has taught you how to easily write a powerful tokenizer. Using these concepts, you can write complex tokenizers quickly, thus saving you significant development time.

Bhabani Padhi is a Java architect and programmer currently working on Web and enterprise application development using Java technology at UniteSys, Australia. Previously he worked at Baltimore Technologies, Australia on e-security product development and at Fujitsu, Australia on an EJB server development project. Bhabani's interests include distributed computing, mobile, and Web application development using Java technology.

Learn more about this topic