The Lucene search engine: Powerful, flexible, and free

Easily add searching to your application with Lucene

1 2 Page 2
Page 2 of 2

Search

Search.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object -- a collection of Document objects, one for each document matched by the query -- and an associated relevance score for each document, sorted by score.

public class Search { 
  public static void main(String[] args) throws Exception {
    String indexPath = args[0], queryString = args[1];
    Searcher searcher = new IndexSearcher(indexPath);
    Query query = QueryParser.parse(queryString, "body", 
                              new SimpleAnalyzer());
    Hits hits = searcher.search(query);
    for (int i=0; i<hits.length(); i++) {
      System.out.println(hits.doc(i).get("path") + "; Score: " + 
                         hits.score(i));
    };
  }
}

The built-in query parser supports most queries, but if it is insufficient, you can always fall back on the rich set of query-building constructs provided. The query parser can parse queries like these:

free AND "text search"Search for documents containing "free" and the phrase "text search"
+text searchSearch for documents containing "text" and preferentially containing "search"
giants -footballSearch for "giants" but omit documents containing "football"
author:gosling javaSearch for documents containing "gosling" in the author field and "java" in the body

Beyond basic text documents

Lucene uses three major abstractions to support building text indexes: Document, Analyzer, and Directory. The Document object represents a single document, modeled as a collection of Field objects (name-value pairs). For each document to be indexed, the application creates a Document object and adds it to the index store. The Analyzer converts the contents of each Field into a sequence of tokens.

A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied. The application filters undesired tokens, like stop words or portions of the input that do not need to be indexed, through the Analyzer class. It also modifies tokens as they are encountered in the input, to perform stemming or other term normalization. Conveniently, Lucene comes with a set of standard Analyzer objects for handling common transformations like word identification and stop-word elimination, so indexing simple text documents requires no additional work. If these aren't enough, the developer can provide more sophisticated analyzers.

The application provides the document data in the form of a String or InputStream, which the Analyzer converts to a stream of tokens. Because of this, Lucene can index data from any data source, not just files. If the documents are stored in files, use FileInputStream to retrieve them, as illustrated in IndexFile.java. If they are stored in an Oracle database, provide an InputStream class to retrieve them. If a document is not a text file but an HTML or XML file, for example, you can extract content by eliminating markups like HTML tags, document headers, or formatting instructions. This can be done with a FilterInputStream, which would convert a document stream into a stream containing only the document's content text, and connect it to the InputStream that retrieves the document. So, if we wanted to index a collection of XML documents stored in an Oracle database, the resulting code would be very similar to IndexFiles.java. But it would use an application-provided InputStream class to retrieve the document from the database (instead of FileInputStream), as well as an application-provided FilterInputStream to parse the XML and extract the desired content.

Just as Lucene allows the application to control the handling of raw document data through the Analyzer and InputStream classes, it also defines an abstract class for reading and writing the index store (Directory). Lucene also provides concrete implementations of Directory for storing indexes in RAM (RAMDirectory) or in files (FSDirectory). If, for instance, you want to store the index data in a document control system or database -- or compress or encrypt the index data -- you can simply provide your own Directory class. Most users will use the provided implementations, usually the file-based implementation. But allowing the application to handle index storage enhances the package's flexibility.

A case study

When developing Eyebrowse, we examined -- and discarded -- a number of widely used open source search tools. At first glance, Eyebrowse's search and retrieval features seemed quite straightforward, but we were surprised to find that few of the tools we examined were flexible enough for our purposes. Most search engines are designed to index files or Webpages only -- we didn't want to index either. Message metainformation was stored in an SQL database; message bodies and attachments were stored in mailbox files that contained many individual messages. This would have necessitated an intermediate step in which the mailbox files were exploded into thousands of small files just for indexing purposes, which seemed silly and inefficient.

Because Lucene is a search toolkit, not a monolithic search program, it was much easier to tightly integrate it into our application and control its behavior. Because of Lucene's flexible document model, we were able to construct and index virtual documents, which were a combination of the metadata drawn from the database and the message body drawn from the mailbox file, without having to create any intermediate files. Because it supports efficient incremental indexing, we could add new messages to the index base as they arrived. The built-in query parser supported every query feature we needed, and the search performance was perfectly acceptable. Ultimately, we added the required search features in much less time than we had budgeted, but more importantly, we were very satisfied with the quality of the resulting integration.

What can we learn?

Lucene is a fine example of good object-oriented software design and architecture. A carefully crafted division of labor between the application and the search engine lies beneath its design. This transforms indexing from a monolithic process into a collection of cooperating objects, each performing a single function and operating in a single domain. For example, when indexing a file, the FileInputStream class retrieves the document data; the appropriate Analyzer transforms it into a stream of tokens; the IndexWriter class indexes it; and the FSDirectory class stores the index on disk for later retrieval. Each of these classes performs one function, and each can be easily replaced without affecting the others.

Lucene's factoring leaves the application in charge of functions that it already knows about -- selecting and retrieving documents, storing the index data -- and leaves the search engine to do what it does best. However, good factoring between the component and application domains is only part of what makes a software toolkit easy to use. A useful set of default implementations for the application-domain objects is equally important. Instead of just dumping the application-domain problems in the developer's lap, Lucene provides a set of tools for solving the most common application-domain problems. This supports the design principle of commensurate effort -- the user does not have to learn much about the architecture to implement its basic functionality, but can access more advanced functionality with additional effort. The result: developers can often integrate Lucene's searching capabilities with their projects in just a few hours.

We can all learn something from Lucene's design. While many programs make excellent use of abstraction, not many are able to craft abstractions that a new user can easily and quickly grasp, and few provide all the pieces that allow users to get up and running so quickly. When a software tool demands that its users completely understand everything about it before they can benefit from it, it alienates would-be users. Shouldn't we all make our software inviting, rather than intimidating, to our users?

Conclusion

Lucene is the most flexible and convenient open source search toolkit I've ever used. Cutting describes his primary goal for Lucene as "simplicity without loss of power or performance," and this shines through clearly in the result. Lucene's design seems so simple, you might suspect it is just the obvious way to design a search toolkit. We should all be so lucky as to craft such obvious designs for our own software.

Brian Goetz is a professional software developer with over 15 years of experience. He is a principal consultant at Quiotix Corporation, a software development and consulting firm in Los Altos, Calif.

Learn more about this topic

1 2 Page 2
Page 2 of 2