The Lucene search engine: Powerful, flexible, and free

Easily add searching to your application with Lucene

Don't let the low version number -- 0.04 as of August 2000 -- fool you. The Lucene search engine is a robust, powerful, and flexible search toolkit, ready to tackle many common search problems. And since it's now available under the more flexible LGPL open source license, the price (free!) is right too.

Doug Cutting, an experienced developer of text-search and retrieval tools, created Lucene. Cutting is the primary author of the V-Twin search engine (part of Apple's Copland operating system effort) and is currently a senior architect at Excite. He designed Lucene to make it easy to add indexing and search capability to a broad range of applications, including:

  • Searchable email: An email application could let users search archived messages and add new messages to the index as they arrive.
  • Online documentation search: A documentation reader -- CD-based, Web-based, or embedded within the application -- could let users search online documentation or archived publications.
  • Searchable Webpages: A Web browser or proxy server could build a personal search engine to index every Webpage a user has visited, allowing users to easily revisit pages.
  • Website search: A CGI program could let users search your Website.
  • Content search: An application could let the user search saved documents for specific content; this could be integrated into the Open Document dialog.
  • Version control and content management: A document management system could index documents, or document versions, so they can be easily retrieved.
  • News and wire service feeds: A news server or relay could index articles as they arrive.

Of course, many search engines could perform most of those functions, but few open source search tools offer Lucene's ease of use, rapid implementation, and flexibility.

I first used Lucene when developing Eyebrowse, an open source Java-based tool for cataloguing and browsing mailing lists. (See Resources for a link.) A core requirement for Eyebrowse was flexible message search and retrieval capability. It demanded an indexing and search component that would efficiently update the index base as new messages arrived, allow multiple users to search and update the index base concurrently, and scale to archives containing millions of messages.

Every other open source search engine I evaluated, including Swish-E, Glimpse, iSearch, and libibex, was poorly suited to Eyebrowse's requirements in some way. This would have made integration problematic and/or time-consuming. With Lucene, I added indexing and searching to Eyebrowse in little more than half a day, from initial download to fully working code! This was less than one-tenth of the development time I had budgeted, and yielded a more tightly integrated and feature-rich result than any other search tool I considered.

How search engines work

Creating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document.

If you think of the postings as tuples of the form <word, document-id>, a set of documents will yield a list of postings sorted by document ID. But in order to efficiently find documents that contain specific words, you should instead sort the postings by word (or by both word and document, which will make multiword searches faster). In this sense, building a search index is basically a sorting problem. The search index is a list of postings sorted by word.

An innovative implementation

Most search engines use B-trees to maintain the index; they are relatively stable with respect to insertion and have well-behaved I/O characteristics (lookups and insertions are O(log n) operations). Lucene takes a slightly different approach: rather than maintaining a single index, it builds multiple index segments and merges them periodically. For each new document indexed, Lucene creates a new index segment, but it quickly merges small segments with larger ones -- this keeps the total number of segments small so searches remain fast. To optimize the index for fast searching, Lucene can merge all the segments into one, which is useful for infrequently updated indexes. To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones -- after any active readers have closed it. This approach scales well, offers the developer a high degree of flexibility in trading off indexing speed for searching speed, and has desirable I/O characteristics for both merging and searching.

A Lucene index segment consists of several files:

  • A dictionary index containing one entry for each 100 entries in the dictionary
  • A dictionary containing one entry for each unique word
  • A postings file containing an entry for each posting

Since Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file. Lucene also implements a variety of tricks to compress the dictionary and posting files -- thereby reducing disk I/O -- without incurring substantial CPU overhead.

Evaluating search engines

Other widely used open source search engines include Swish-E, Glimpse, libibex, freeWAIS, and iSearch. Like any software package, each is optimized for use in particular situations; it is often difficult to deploy these tools outside of their intended domains. Consider the following features when evaluating a search engine:

  • Incremental versus batch indexing: Some search engines only support batch indexing; once they create an index for a set of documents, adding new documents becomes difficult without reindexing all the documents. Incremental indexing allows easy adding of documents to an existing index. For some applications, like those that handle live data feeds, incremental indexing is critical. Lucene supports both types of indexing.
  • Data sources: Many search engines can only index files or Webpages. This handicaps applications where indexed data comes from a database, or where multiple virtual documents exist in a single file, such as a ZIP archive. Lucene allows developers to deliver the document to the indexer through a String or an InputStream, permitting the data source to be abstracted from the data. However, with this approach, the developer must supply the appropriate readers for the data.
  • Indexing control: Some search engines can automatically crawl through a directory tree or a Website to find documents to index. While this is convenient if your data is already stored in this manner, crawler-based indexers often provide limited flexibility for applications that require fine-grained control over the indexed documents. Since Lucene operates primarily in incremental mode, it lets the application find and retrieve documents.
  • File formats: Some search engines can only index text or HTML documents; others support a filter mechanism, which offers a simple alternative to indexing word processing documents, SGML documents, and other file formats. Lucene supports such a mechanism.
  • Content tagging: Some search engines treat a document as a single stream of tokens; others allow the specification of multiple data fields within a document, such as "subject," "abstract," "author," and "body." This permits semantically richer queries like "author contains Hamilton AND body contains Constitution." Lucene supports content tagging by treating documents as collections of fields, and supports queries that specify which field(s) to search.
  • Stop-word processing: Common words, such as "a," "and," and "the," add little value to a search index. But since these words are so common, cataloging them will contribute considerably to the indexing time and index size. Most search engines will not index certain words, called stop words. Some use a list of stop words, while others select stop words statistically. Lucene handles stop words with the more general Analyzer mechanism, to be described later, and provides the StopAnalyzer class, which eliminates stop words from the input stream.
  • Stemming: Often, a user desires a query for one word to match other similar words. For example, a query for "jump" should probably also match the words "jumped," "jumper," or "jumps." Reducing a word to its root form is called stemming. Lucene does not yet implement stemming, but you could easily add a stemmer through a more sophisticated Analyzer class.
  • Query features: Search engines support a variety of query features. Some support full Boolean queries; others support only and queries. Some return a "relevance" score with each hit. Some can handle adjacency or proximity queries -- "search followed by engine" or "Knicks near Celtics" -- others can only search on single keywords. Some can search multiple indexes at once and merge the results to give a meaningful relevance score. Lucene supports a wide range of query features, including all of those listed above. However, Lucene does not support the valuable Soundex, or "sounds like," query.
  • Concurrency: Can multiple users search an index at the same time? Can a user search an index while another updates it? Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index.
  • Non-English support: Many search engines implicitly assume that English is the target language; this is evident in areas such as stop-word lists, stemming algorithms, and the use of proximity to match phrase queries. As Lucene preprocesses the input stream through the Analyzer class provided by the developer, it is possible to perform language-specific filtering.

Though by no means exhaustive, the above list offers a starting point for evaluating a search engine for a particular project. Some search tools are poorly suited to certain tasks -- understanding your application's requirements can help you choose the right tool for the job.

Using Lucene

I will illustrate how to use Lucene to create, populate, and search an index. For clarity, import statements and exception handling have been omitted from the sample programs. In these illustrations, I have stored the search index in the filesystem (you can store indexes anywhere, e.g., in memory or in a database). The files being indexed are simple text files. With Lucene, you can also easily index other document formats and documents not stored in files.

Create an index

The simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line.

public class CreateIndex { 
  // usage: CreateIndex index-directory
  public static void main(String[] args) throws Exception {
    String indexPath = args[0];
    IndexWriter writer;
    // An index is created by opening an IndexWriter with the 
    // create argument set to true.  
    writer = new IndexWriter(indexPath, null, true);
    writer.close();
  }
}

Index text documents

IndexFile.java shows how to add documents -- the files named on the command line -- to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucene's point of view, a Document is a collection of fields that are name-value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata -- such as name, author, or modification date -- with a document. For example, when storing mail messages, you could put a message's subject, author, date, and body in separate fields, then build semantically richer queries like "subject contains Java AND author contains Gosling." In the code below, we store two fields in each Document: path, to identify the original file path so it can be retrieved later, and body, for the file's contents.

public class IndexFiles { 
  // usage: IndexFiles index-path file . . . 
  public static void main(String[] args) throws Exception {
    String indexPath = args[0];
    IndexWriter writer;
    writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false);
    for (int i=1; i<args.length; i++) {
      System.out.println("Indexing file " + args[i]);
      InputStream is = new FileInputStream(args[i]);
      // We create a Document with two Fields, one which contains
      // the file path, and one the file's contents.  
      Document doc = new Document();
      doc.add(Field.UnIndexed("path", args[i]));
      doc.add(Field.Text("body", (Reader) new InputStreamReader(is)));
      writer.addDocument(doc);
      is.close();
    };
    writer.close();
  }
}
1 2 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more