Java: A platform for platforms
Sun's reorg may seem promising to shareholders but it's also a scramble for position. The question now is whether Sun can, or wants to, maintain its hold on Java technology. Especially with enterprise leaders like SpringSource and RedHat investing heavily in Java's future as a platform for platforms

Also see:

Discuss: Tim Bray on 'What Sun Should Do'

Featured Whitepapers
Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Plant your data in a ternary search tree

Create an English dictionary that checks spelling and matches words as you type

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
The ternary search tree (TST) is the champion of data structure acrobatics -- it finds all keys having a given prefix, suffix, or infix. It even finds those keys that closely match a given pattern. You can easily search the tree for partial matches -- auto matches autograph, automatic, and so on. In addition, you can implement near-match functions, which gives you the ability to suggest alternatives for misspelled words -- e.g., impliment matches implement.

A TST stores key-value pairs, where keys are strings and values are objects. TST keys are stored and retrieved in sorted order, regardless of the order in which they are inserted into the tree. In addition, TSTs use memory efficiently to store large quantities of data. Best of all, all of this magic is performed lightning fast. The tremendous flexibility of TSTs provides ample opportunity for creative programming.

This article demonstrates three creative applications of the TST:

  • An English dictionary that matches words as you type and checks spelling
  • A flexible array that can assume any size or dimension on the fly
  • A database that stores all information in the same place (regardless of which record or column the information belongs to), thereby decreasing access time and reducing storage requirements


Let's begin with a demonstration of the dictionary application.

The dictionary applet

The applet below simulates an English dictionary that uses a TST to store dictionary data. To create this applet, I started with a list of roughly 48,000 words that I found on the Internet. (I can't promise that the list is complete or that all of the words are spelled accurately.) For each word, I created a definition equal to the string "Definition of the word " + currentWord + ".", and stored these key-value pairs (word, definition) in the TST.

Figure 1. An English dictionary built on a ternary search tree


TST structure

From an external perspective, the TST behaves like a HashMap or a HashTable. That is, objects stored in the tree are indexed using string keys. Internally, the TST stores each key character in a separate node.

Figure 2. A ternary search tree
Click on thumbnail to view
full-size image. (19 KB)



Figure 2 illustrates two types of objects: objects that represent tree nodes (no color) and value objects (yellow or gray) encapsulated by the nodes. Arrows represent references -- if a bidirectional arrow connects nodes "s" and "o," then "s" has a reference to "o," and "o" has a reference to "s." Each TST node encapsulates one character from a key (its splitchar), one value object, a reference to its parent node, and a reference to each of three child nodes -- lo kid, equal kid, and hi kid. A lo kid is below and to the left of its parent, an equal kid is directly below its parent, and a hi kid is below and to the right of its parent. The highest node in the tree is the root node; its parent is null.

The TST in Figure 2 stores five key-value pairs. The keys are so, tab, to, too, and tot. The string too, for example, points to an instance of an object labeled too data in the diagram.

TST algorithms

Let's turn to the manipulation of data stored in the TST. You need a means for using a string key to retrieve data from the tree, using a string key to insert data into the tree, retrieving any data whose key matches a given prefix, and for retrieving any data whose key closely matches a given string.

The following algorithm retrieves the object indexed by a string key.

Object get(String key) {
    TSTNode currentNode = rootNode;
    int charIndex = 0;
    while(true) {
        if(currentNode == null) return null;
        if (key.charAt(charIndex) == currentNode.splitchar) {
            charIndex++;
            if(charIndex == key.length()) return currentNode.data;
            currentNode = currentNode.relatives[TSTNode.EQKID];
        } else if(key.charAt(charIndex) < currentNode.splitchar) {
            currentNode = currentNode.relatives[TSTNode.LOKID];
        } else {
            currentNode = currentNode.relatives[TSTNode.HIKID];
        }
    }
}


Within a given iteration, this algorithm focuses on a certain node within the tree -- the currentNode -- and on a certain character within the key -- though not explicitly named above, let's call it currentChar.

If currentChar comes before the currentNode's splitchar in the alphabet, the algorithm shifts its focus to the currentNode's lo kid, does not change the value of currentChar, and repeats the process. If currentChar comes after the currentNode's splitchar, the algorithm shifts to the currentNode's hi kid, does not change currentChar's value, and repeats the process. If currentChar is equal to the currentNode's splitchar, the algorithm shifts to the currentNode's equal kid, sets the value of currentChar to the next char in the key, and repeats the process. When the currentChar's value changes, the algorithm checks to see if the end of the key has been reached; if it has, it returns the data object -- the value -- stored in the currentNode.

Refer back to Figure 2 as I walk you through the process of retrieving the data indexed by tab. First, set currentChar to the first letter in the key -- tab -- and set currentNode to the root node -- the "t" node, or highest node in Figure 2. Compare currentChar with the rootNode's splitchar; since both are equal to "t," shift to the "o" node and set currentChar to the next character in tab. Since "a" comes before "o" in the alphabet, move to the lo kid -- the "a" node.

Since both currentChar and splitchar are now "a," move to the equal kid -- the "b" node -- and set currentChar to the next character in tab. Now, both currentChar and splitchar equal "b." You've reached the end of the key, so the data object stored in the "b" node is the data you're after.

Consider the problem of storing the key-value pair: tab, myObject. To store the pair, you can use the same algorithm with one slight modification -- once it finds the tab node, instead of returning the data object, the algorithm stores myObject in the node (overwriting any data already stored in that node).

Suppose you wish to store a key-value pair and the tree lacks some or all of the nodes required to represent the key. Again, you use nearly the same algorithm. Upon entering a new iteration of the while loop, if the currentNode is null, simply create the missing node and add it to the tree. The new node's position should be the position in which you expected to find it.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources