Use search engine technology for object persistence

How a seemingly unrelated technology can help solve some typical problems

Java developers are often required to provide a simple persistence mechanism for their Java classes. Many of the problems related to that task fall into this great gray area where property-file-based persistence is simply not enough, but database-oriented persistence (and related object-relational mapping) is definite overkill.

The typical solution is to create a simple datastore using object serialization or XML binding. Attractive as it is, this solution often does not scale enough to handle even a few thousand objects, especially in terms of providing decent search performance.

So how would one create a datastore capable of persisting a significant quantity of JavaBeans, and provide speedy search and retrieval of those objects without resorting to a relational database or complex memory caching schemes?

In this article, I show you how to approach this problem from a different angle: by treating individual objects as documents being indexed by an Internet search engine. I demonstrate how to index individual attributes of standard JavaBeans using a popular third-party indexing/searching library and how to quickly retrieve those attributes from storage. The usual database API consisting of find, retrieve, store, and delete methods is provided.

As an example of real-world applications for this approach, I describe a Unix-like permission system (users and groups) for the plain Java objects.

The API

Essentially, we are developing a library—something other programmers will hopefully use in their own projects. A good library starts with a good interface. It is imperative to design the interface (or interfaces) first, before a single line of "real" code is written. So what do we want our library to do?

Obviously, we want it to store and retrieve Java objects. One thing to keep in mind is the fact that our objects will be actually persisted on disk to be retrieved later, long after the program that created them is gone. We need something that helps us differentiate one object instance from another. Thus, all objects that flow through our library must have a unique object ID. It would also be convenient if we could immediately identify whether an object is compatible with our library. The simple way to achieve this goal is to have all storable objects implement this interface:

 public interface StorableInterface {
    public String getObjectId();
    public void setObjectId(String id);
}

Now, let's think about what our actual storage service will do. We use the following interface to define it:

   public void store(StorableInterface object) throws StorageException;
    public StorableInterface retrieve(String objectId) throws StorageException;
    public StorableInterface[] find(String key, String value) throws StorageException;
    public StorableInterface[] find(Class clazz) throws StorageException;
    public StorableInterface[] find(String query) throws StorageException;    
    public void delete(String objectId) throws StorageException;
    public void delete(String key, String value) throws StorageException;

As you can see, we provide the methods to:

  • Store a Storable object to disk.
  • Retrieve an object given its objectId.
  • Return an array of storable objects matching a given property name (key) and value (so we can execute queries like "firstName" = "Joe").
  • Return all objects of a given class.
  • Return all objects matching some free-form query. (We are cheating a little bit with this one. For now, we are just talking about queries like "firstName = "Joe" and "lastName" = "Smith". If later we implement more complex functionality, a free-form query will accommodate it as well.)
  • Delete an object by objectId.
  • Delete several objects matching a given property name and value.

We wrap (and possibly log) any exception thrown by the meat of our methods and rethrow it as a custom StorageException.

This API is simple but powerful. Many applications can benefit from having it available. Now let's see how we can actually implement such a service using search engine technology.

Lucene to the rescue

Fundamentally, the problem at hand is the storage of large quantities of arbitrary information and its subsequent retrieval. A technology has emerged to address this need—a search engine. Probably the most important aspect of today's computing is the remarkable interaction between a user and an Internet search engine such as Google. Suddenly, mountains of data are at your fingertips. The advance of search technology has certainly captured the mindshare of software developers, and numerous solutions have popped up to initiate the addition of Internet-like search capabilities to everyday applications.

One of the most mature, successful, and celebrated search engine toolkits available to today's Java programmer is Jakarta Lucene. According to its Website, "Lucene is a high-performance, full-featured text search engine library written entirely in Java." I have used Lucene on several occasions and am continually amazed at the speed and accuracy of the implementation. Lucene deserves articles (and books!) on its own, so I won't discuss the details of its use here. Let's just say that Lucene allows programmers to index arbitrary context and later find and retrieve the references to it.

If you want to index something with Lucene, first you need to create a Writer:

 writer = new IndexWriter("where_out_index_is_stored", new StandardAnalyzer(), true);

Then you need to create an instance of the org.apache.lucene.document.Document object, one that represents something you want to index and search for later, be it a Webpage, a text document, or anything else:

 Document doc = new Document();

You populate your Document by adding some Fields to it. A Field represents a property of a Document being indexed. When creating a Field, you must supply a name and a value:

  Field field = new Field("name", "value", true, true, true);
   doc.add(field);
   writer.addDocument(doc);

(Note: Boolean values tucked at the end of the method call are related to how Lucene will treat our Field, but not relevant to our discussion).

At this point, we are ready to search Lucene for our documents. The following simple code snippet illustrates the process:

  IndexSearcher searcher = new IndexSearcher("where_our_index_is_stored");
   String queryString = "name:value"; // Will match any Docs where field name is "name" and 
                                     // value is "value"
   Analyzer analyzer = new StandardAnalyzer();     
   Query query = (new QueryParser("some field name", analyzer)).parse(queryString);
   Hits hits = searcher.search(query);
   for(int i=0; i< hits.length(); i++){
      Document doc = hits.doc(i);
      ...
   }          

At this point, programmers can examine returned Document objects and use them. Lucene stores the indexed documents in a highly-organized, proprietary format. If you try to code a Lucene search, look at the contents of your hard drive. Somewhere under the subdirectory called where_our_index_is_stored, you will find the actual Lucene index.

So far, you have learned how easy Lucerne is to use for indexing and searching for arbitrary documents. Read on to see how to use Lucene in our quest for simple yet powerful Java object persistence.

The gory details

Programmers using our library will need to have a simple way to obtain a reference to it. We already defined our service. Let's now define a factory users can use to obtain an instance of our service. A factory produces an object conforming to the service interface, and we don't need to know how the interface is implemented:

 public class StorageFactory {
   private StorageFactory() {
   // Private constructor prevents object creation
   }
   public static StorageServiceInterface getStorageService(String key) throws StorageException {
      try {
         return new StorageServiceImpl(key);
      } catch(Exception ex) {
         throw new StorageException(ex.getMessage());
      }
   }    
}

A string key specifies where our index is created on the filesystem to separate one database from another. A StorageServiceImpl object is our concrete implementation of StorageServiceInterface. Let's look at its design details.

We will use Lucene Document to represent an instance of a Java object being stored. A Document is simply a collection of name/value pairs represented by Fields. We now need a few utility methods to manage the conversion of our Storable Java objects to and from Lucene Document entries.

Below is a method converting a Storable object to a Document and adding it to the Lucene index. This method uses reflection to extract all JavaBean properties present in an object. An individual Field represents key/value pairs, where a key is the name of a getter method (without "get") and the value is a string representation of this object property.

  private void addDocument(StorableInterface o) throws Exception{
      Document doc = new Document();
      doc.add(new Field("id", o.getObjectId(), true, true, true));
      Class clazz = o.getClass();
      Method[] methods = clazz.getMethods();
      doc.add(new Field("Class", o.getClass().getName(), true, true, true)); 
      for(int i=0; i < methods.length; i++)  {
         if(methods[i].getName().startsWith("get")) {
            Object val = methods[i].invoke(o, null);
            if(val != null && !methods[i].getName().substring(3).equals("Class"))
            doc.add(new Field(methods[i].getName().substring(3), val.toString(), true, true, true));
         }
      }
      writer.addDocument(doc);
   }

Now we need a method capable of converting the Lucene Document object back to our Storable object:

 

private StorableInterface documentToObject(Document doc) throws Exception {

String id = doc.getField("id").stringValue(); Class clazz = Class.forName(doc.getField("Class").stringValue()); Object o = clazz.newInstance(); Method[] methods = clazz.getMethods(); for(int i=0; i < methods.length; i++) { if(methods[i].getName().startsWith("set")) { Class argType = methods[i].getParameterTypes()[0]; Field field = doc.getField(methods[i].getName().substring(3)); String argValue = null; if(field != null) argValue = field.stringValue(); if(argValue != null) { Object arg = null; if(argType.equals(boolean.class)) { if(argValue.equals("true")) arg = new Boolean(true); else arg = new Boolean(false); } else { arg = argType.getConstructor(new Class[] { String.class }) .newInstance(new String[] {argValue} ); } methods[i].invoke(o, new Object[] {arg} ); } } } StorableInterface result = (StorableInterface)o; result.setObjectId(id); return result; }

And here is another utility method used for searching the Lucene index and converting the search results back to Java objects:

  private StorableInterface[] search(String queryString) throws Exception {
      searcher = new IndexSearcher(idx);
      Analyzer analyzer = new StandardAnalyzer();
      Query query = (new QueryParser("content", analyzer)).parse(queryString);
      Hits hits = searcher.search(query);
      StorableInterface[] results = new StorableInterface[hits.length()];
      for(int i=0; i < hits.length(); i++)  {
         Document doc = hits.doc(i);
         results[i] = documentToObject(doc);
      }      
      return results;
   }

Having coded all those utility methods, we are now ready to complete our StorageServiceImpl class. In the following code, I omit the necessary exception handling for brevity. Our retrieve() method looks like this:

  public StorableInterface retrieve(String objectId) throws StorageException {
      StorableInterface[] results = this.search("id:" + objectId);
      if(results.length == 0)
      return null;
      return results[0];
   }        

The store() method handles object persistence:

  public void store(StorableInterface object) throws StorageException {
      delete(object.getObjectId());
      addDocument(object);
   }

Our find() methods are equally simple:

 

public StorableInterface[] find(String query) throws StorageException { return search(query); }

public StorableInterface[] find(String key, String value) throws StorageException { return find(key + ":" + value); }

public StorableInterface[] find(Class clazz) throws StorageException { return find("Class:" + clazz.getName()); }

And here is delete() by objectId:

  public void delete(String objectId) throws StorageException {
      reader.delete(new Term("id", objectId));
      closeReader();
   }

Or delete several objects at once by executing a query:

  public void delete(String key, String value) throws StorageException {
      reader.delete(new Term(key, value));            
      closeReader();
   }

This concludes the relevant methods in our StorageServiceImpl class. You can find the complete code for download in Resources, including a JUnit test to test-drive our creation.

1 2 Page 1