Introduction to Hibernate Search

Bring the power of Lucene to your database-backed applications

Many Web applications exist to provide access to copious amounts of data stored in a relational database, but what's the easiest way to enable users to search through that data and find what they need? In this article, Dr. Xinyu Liu introduces Hibernate Search, which integrates the sophisticated search capabilities of Lucene with the familiar object-relational mapping framework of Hibernate.

Apache Lucene is a high-performance, extensible full-text search-engine library written in Java. At first, it may not be obvious why you'd need such a thing -- after all, your data is nicely filed away in a decent relational database. While an RDBMS can do a great job of providing transactional CRUD operations on data stored in a relational model, search functions defined in SQL are not always capable of meeting both the functional and non-functional requirements of your projects. There are a number of query types that RDBMSs in general do not support without vendor extensions:

  • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches
  • Word stemming queries, which consider "take," "took," and "taken" to be identical
  • Sound-like queries, which consider "cat" and "kat" to be identical
  • Synonym queries, which consider "jump," "hop," and "leap" to be identical
  • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents

More disappointingly, SQL search results are not ranked by match-relevance scores. The SQL standard is simply not intended for full-text querying.

Lucene search capabilities, on the other hand, are unlimited. Lucene handles all the queries just mentioned, and more; it also allows you to find text documents similar to other documents through its advanced term-vector query. For instance, you could search the content of a number of books to find one with content similar to that of Hibernate in Action. The analyzer architecture in Lucene leverages Java's built-in internationalization and localization capabilities, which makes full-text query available for various languages worldwide. Lucene delivers outstanding performance through some innovative techniques, such as an inverted index. The Apache Lucene Web site features a list of performance benchmarks that demonstrate how well Lucene performs and scales.

Note that some database vendors do implement full-text search functions in their products as SQL extensions. To some degree, these proprietary functions are quite easy to use, but they compromise the portability of your applications at the database level. Besides, the features are no match for the user experience that Lucene offers, and under extreme conditions Lucene's performance is superior.

Hibernate and the Java Persistence API

Hibernate is a high-performance, mature object-relational mapping (ORM) library. As a non-intrusive ORM solution, Hibernate provides object query APIs for plain old Java object (POJO) persistence model classes and automatic data bindings between the object and relational representations of persistence data. In essence, it lets you focus on domain model-oriented programming.

The Java Persistence API (JPA) is the standard object-relational mapping and persistence management interface defined as part of Java EE 5, the latest version of the enterprise Java specification. Largely inspired by Hibernate, JPA emerged to replace the controversial entity bean programming model. JPA has an easy-to-use POJO programming style and object query interface (JPAQL); one improvement of JPA over entity beans is that you do not need an EJB 3 container to run applications that use the API, because it supports both standalone (Java SE) and container-managed (Java EE) running modes. Popular JPA providers include Apache OpenJPA and Oracle TopLink, as well as Hibernate itself, which implements the JPA specification through the add-on Hibernate Annotations and Hibernate EntityManager modules. In this article, I'll use JPA/Hibernate as shorthand for the two working together.

This article presents the technology of Hibernate Search to you through a sample application programmed in a POJO style with the latest Spring 2.5 annotations. Before you begin, you should have basic knowledge of Spring, Hibernate/JPA, and Lucene.

Hibernate Search

Several key factors allow Hibernate and Lucene to align well by nature. They both provide CRUD access to the underlying data storage. They both define an elementary operational data unit -- the Entity (persistence model class) in Hibernate/JPA, and the Document in Lucene. And the same programming concepts coexist in Hibernate/JPA and Lucene -- deferred commit, filter, query expression, and query API are examples. To enable batch updates for better performance, Hibernate/JPA has a flush() method defined in its persistence context to synchronize cached data changes with the back-end database. The close() method of the Lucene IndexWriter class essentially works the same way -- it defers data synchronization between memory and storage to reduce I/O and minimize network latency.

Despite these similarities, the differences between Hibernate/JPA and Lucene are also obvious. Hibernate/JPA promotes domain model-oriented programming by encouraging developers to work out a rich domain-object graph that naturally represents the complexities of the real-world business through object association, inheritance, polymorphism, composition, and collections. Nevertheless, Lucene only deals with a single, built-in data model -- the Document class, which is too simple to describe those complex relationships.

The Hibernate team has recently introduced Hibernate Search as a higher level, universal API that encapsulates the virtues of both Hibernate/JPA and Lucene. Hibernate Search is an independent offering from the Hibernate team, and you must download it from the Hibernate Web site separate from the main Hibernate package. By mapping the application-specific persistence model classes to the Lucene Document class, Hibernate Search brings the power of Lucene full-text search to the persistence domain objects managed by Hibernate/JPA. The same persistence context (Hibernate Session/JPA EntityManager) is used for both domain-object persistence and Lucene indexing. Hibernate Search encloses Lucene indexing processes into the transaction contexts of Hibernate/JPA, and transparently manages the lifecycle of Lucene Document objects through the event handler mechanism of Hibernate Core. When auto-indexing is enabled, indexing processes become completely transparent to developers, and development around Hibernate Search thus becomes very easy. Note that developers are still required to learn the syntax of the Lucene query expression and query API in order to perform full-text searches against the persistence domain objects.

The easiest way to understand how Hibernate Search works in practice is through a sample application. In the following sections, you'll see an application designed on top of the latest Spring 2.5 application framework with annotation-driven configurations.

The sample application

Imagine that a startup IT consulting company has asked you to design and implement an application that maintains software developers' resumes in Microsoft Word format, and provides Web access to keyword search on those resume files. You can download a sample Web project partially implementing the requirements from the Resources section below. In the rest of the article, you'll walk through it and see how it works.

The sample application uses Maven 2 as its build tool, and MySQL as the back-end database. You need to download and install Maven 2 and MySQL to be able to build and test the application. A Maven 2 POM file located under the project root folder declares all the external dependencies of the application, including Hibernate, Hibernate EntityManager, Hibernate Search, Hibernate Annotations, JPA interfaces, Lucene, Spring, and Apache POI. If you are using the Eclipse IDE, Maven 2 provides a nice plugin (mvn eclipse:eclipse) that creates an Eclipse project file, so that the unzipped folder structure can be imported into the IDE as an already configured Web project. The same plugin also triggers the download of the external JAR dependencies into a Maven local repository, which is referenced by the project. Some of the dependencies have to be installed manually into the local repository, as they are not yet available at the POM-specified remote repository.

POJO persistence classes

I prefer to begin application programming with a persistence domain model. Two POJO persistence model classes, Resume.java and User.java, are defined in the sample application, as shown in Listing 1.

Listing 1. Resume.java and User.java

package demo.hibernatesearch.model;

@Entity
@Table(name = "resume")
@FilterDefs( { @FilterDef(name = "rangeFilter", parameters = { 
   @ParamDef(name = "beginDate", type = "date"), 
   @ParamDef(name = "endDate", type = "date") }) })
@Filters( { @Filter(name = "rangeFilter", condition = 
   ":beginDate <= lastUpdated and :endDate >= lastUpdated") })
@Indexed
@Analyzer(impl = org.apache.lucene.analysis.standard.StandardAnalyzer.class)
@FullTextFilterDefs( { @FullTextFilterDef(name = "rangeFilter", 
   impl = demo.hibernatesearch.dao.hibernate.utils.RangeFilter.class, cache = true) })
public class Resume implements Serializable {

   @Id
   @GeneratedValue
   @DocumentId
   private Long id;

   @OneToOne
   @IndexedEmbedded
   private User applicant;

   @org.hibernate.annotations.Index(name = "summaryIndex")
   @Field(index = Index.TOKENIZED, store = Store.YES)
   private String summary;

   @Lob
   @Field(name = "resume", index = Index.TOKENIZED, store = Store.NO)
   @FieldBridge(impl = demo.hibernatesearch.dao.hibernate.utils.WordDocHandlerBridge.class)
   private byte[] content; // MS Word Doc

   @Temporal(value = TemporalType.DATE)
   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   @DateBridge(resolution = Resolution.DAY)
   @Boost(2.0f)
   private Date lastUpdated; 

   //...
}

@Entity
@Table(name = "user")
@Indexed
@Analyzer(impl = org.apache.lucene.analysis.standard.StandardAnalyzer.class)
public class User implements Serializable {

   @Id
   @GeneratedValue
   @DocumentId
   private Long id;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String firstName;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String lastName;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String middleName;

   @NaturalId
   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String emailAddress;

   @OneToMany(cascade = { CascadeType.ALL }, mappedBy = "applicant")
   @ContainedIn
   private Set<Resume> resumes;

   //...
}

The two persistence classes are annotated with the JPA @Entity tag, which declares that their nontranisent properties will be persisted to a relational database. A Maven 2 Hibernate plugin goal (mvn hibernate3:hbm2ddl) outputs a SQL script from the annotated entity Java source files, and creates the corresponding database schema in the MySQL database.

Apart from the JPA annotations, the two entity classes are also marked with the new Hibernate Search annotations. Any JPA entity class marked with the @Indexed annotation is enabled for Lucene indexing, and is mapped to a unique Lucene index. Hibernate Search implicitly matches an entity instance to a Lucene Document object. More specifically, only bean properties annotated by @Field are indexed as Fields in the Lucene Document objects. As discussed earlier, Lucene Document objects are the data unit for indexing and search as JPA entities in database persistence. Note that you don't have to index all of your JPA entity classes with Lucene, only those for which full-text search is required.

Even though a Lucene Document by itself doesn't enforce a unique key field, Hibernate Search requires you to specify a document ID field through a @DocumentId annotation. Most of the time, this ID is also a database primary key. Hibernate Search uses that field internally to match a Lucene Document object to an entity instance.

Lucene indexing can't deal with any data type other than text strings; thus, all bean properties to be indexed must be converted to their string representations. A FieldBridge in Hibernate Search works as a data type converter in JSF or Spring that transforms any data type into text. The built-in FieldBridges take care of all the built-in Java data types. Nevertheless, a byte array property in the Resume entity hosting Microsoft Word files requires special care. I developed a custom FieldBridge WordDocHandlerBridge as part of the sample application to extract pure text from Word documents; I used Apache POI, a Java API for accessing Microsoft Office files. The same concept exists in Lucene as DocumentHandler. A wide variety of DocumentHandler implementations are available online for certain binary data types, including Word, Excel, PDF, HTML, and XML. It is nearly effortless to build custom FieldBridges out of the existing Lucene DocumentHandlers.

Two attributes of the @Field annotation -- index=Index.TOKENIZED and store=Store.NO -- characterize Lucene indexing features, and so does the @Boost annotation. The first attribute ensures that the text will be tokenized by a Lucene Analyzer. Tokenizing is a critical step in indexing that preprocesses the source data before actual indexing takes place. Preprocessing involves removal of stop words, replacement of stemming words, and so on. I choose the Lucene StandardAnalyzer for the sample application. Bear in mind that you should not tokenize any primary key or natural key field.

The attribute store=Store.NO indicates that the actual data (that is, the Word documents) will not be stored in the index; therefore, that bean property will be returned through a separate SQL query as part of the Hibernate full-text search that returns the Resume entity objects. In contrast, if I had set store=Store.YES, the original Word documents would have been stored in and retrieved from the index. The boost factor identified by @Boost gives more or less weight for the annotated property in Lucene index search, and affects the relevance scores of the search results.

You may notice a bidirectional many-to-one relationship between the Resume and User entities. The data model in Lucene's Document class doesn't handle relationships through foreign key references, as a database does. In order to enable full-text search on these relationships, you must denormalize the nested objects as dot-navigated fields in the Lucene Document objects. By marking the applicant property of the Resume entity with an @IndexedEmbedded annotation, I advise Hibernate Search to index the applicant property as a list of nested fields inside the Resume index. In the section below entitled "Luke: Lucene index toolbox," you'll see what the Resume index looks like.

The @IndexedEmbedded annotation is applicable to both @*ToOne entity relationships and embedded objects (components, in Hibernate terminology). @*ToMany relationships and collections of embedded objects are not supported. As a result, you can't make use of full-text search on an entity property of a collection type; instead, the search should be initiated from the singular end of the relationship. It is obvious that the Resume index now must be updated in accordance with any change on a User entity. I place a @ContainedIn annotation on the Resume collection of the User entity to advise Hibernate Search of that dependency. Note that the @ContainedIn annotation is only required on entity relationships, and not on embeddable (value type) objects.

A Hibernate search DAO class

All the database persistence, full-text indexing, and query logic is placed in a single POJO-style DAO class: ResumeDaoHibernate.java, shown in Listing 2.

Listing 2. ResumeDaoHibernate.java

package demo.hibernatesearch.dao.hibernate;

@Repository("resumeDao")
public class ResumeDaoHibernate implements ResumeDao {

   protected final Log log = LogFactory.getLog(getClass());

   JpaTemplate jpaTemplate;

   @Autowired
   public ResumeDaoHibernate(EntityManagerFactory entityManagerFactory) {
      this.jpaTemplate = new JpaTemplate(entityManagerFactory);
   }

   //...
}

Two new annotations -- @Repository and @Autowired -- are Spring 2.5 features that suggest the DAO nature of the bean and the bean autowiring function, respectively. The DAO class is a POJO in that it doesn't inherit a JpaDaoSupport class as it might have in Spring 1- and Spring 2.0-based development, and in that the use of Spring's JpaTemplate class is not mandatory. You can inject an EntityManager instance (instead of the JpaTemplate class) via the JPA @PersistenceContext annotation.

Inside the DAO class, I first define a set of CRUD methods, as shown in Listing 3. These are the simplest CRUD methods possible in Spring-assisted JPA. There is no explicit reference to Hibernate Search or Lucene whatsoever in these methods.

Listing 3. CRUD methods

public void saveApplicant(User applicant) {
      getJpaTemplate().persist(applicant);
   }

   public void updateApplicant(User applicant) {
      getJpaTemplate().refresh(applicant);
   }

   public User getApplicant(Long id) {
      return getJpaTemplate().find(User.class, id);
   }

   public void deleteApplicant(User applicant) {
      getJpaTemplate().remove(applicant);
   }

   public void saveResume(Resume resume) {
      getJpaTemplate().persist(resume);
   }

   public void updateResume(Resume resume) {
      getJpaTemplate().refresh(resume);
   }

   public Resume getResume(Long id) {
      return getJpaTemplate().find(Resume.class, id);
   }

   public void deleteResume(Resume resume) {
      getJpaTemplate().remove(resume);
   }

Magic happens behind the scenes when Hibernate Search auto-indexing is turned on. Auto-indexing, which is enabled by default, automatically saves, updates, or deletes the Lucene Document object from index files whenever a corresponding JPA entity instance is persisted, updated, or deleted from the database through a persistence context (Hibernate Session/JPA EntityManager). This makes the entire indexing processes transparent to application developers. Manual indexing in Hibernate Search is only needed when index files are corrupted, or when there is existing data in the database to be indexed.

To differentiate and compare SQL queries with Lucene index search, JPA query methods in the DAO class are prefixed with db, while Lucene full-text search methods are prefixed with se. Listing 4 provides examples.

Listing 4. Finding resumes for a user

@SuppressWarnings("unchecked")
   public List<Resume> dbFindResumesForUser(String emailAddress) {
      return (List<Resume>) getJpaTemplate().find(
         "from Resume resume left join fetch applicant where " +
         "resume.applicant.emailAddress='" 
         + emailAddress + "'");
   }

   @SuppressWarnings("unchecked")
   public List<Resume> seFindResumesForUser(final String emailAddress) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
         public Object doInJpa(EntityManager em) throws PersistenceException {

            FullTextEntityManager fullTextEntityManager = createFullTextEntityManager(em);

            TermQuery tq = new TermQuery(new Term("applicant.emailAddress", emailAddress));

            FullTextQuery fq = fullTextEntityManager.createFullTextQuery(tq, Resume.class);

            return fq.getResultList();
         }
      });
      return (List<Resume>) results;
   }

These two methods find all the resumes for an applicant with a given e-mail address. The first method is a pure JPA query expressed by JPAQL. The second is a Lucene full-text search. The FullTextEntityManager class in Hibernate Search is a decorator of the JPA EntityManager (a similar FullTextSession exists for the Hibernate Session). It executes Lucene queries declared with the Lucene query API or query expression inside a JPA persistence context. The same persistence context may be used simultaneously by JPA for database queries.

An advanced use case is to execute a full Lucene text query on an entity class, followed by a JPA query to fetch its relationships not stored in the Lucene index. The Lucene term query in this method searches the applicant.emailAddress field in the Resume index. Hibernate Search implicitly converts the returned Lucene Document objects into Resume entities as the search results. There is an implicit database query fired by Hibernate Search to fetch the resume Word documents not stored in the Lucene index as part of the returned Resume entities. It is very important to note that the returned entities are in the persistent state of the JPA entity lifecycle.

The next two methods, shown in Listing 5, return the match count of the Resume entities for a keyword search on the summary property. A Lucene filter is used to specify a query date range for the lastUpdated field, and so is a Hibernate filter. I had to wrap the Lucene built-in RangeFilter, because Hibernate Search requires a no-arg constructor for all the Lucene filters declared in the entity classes through annotations. Like Hibernate filters, the Lucene filter must be turned on programmatically in the search method.

Listing 5. Find match count for keyword search

public int dbFindMatchCount(final Date beginDate, final Date endDate,
         final String... keywordsInSummary) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
         public Object doInJpa(EntityManager em) throws PersistenceException {

            StringBuilder jpaql = new StringBuilder(
               "select count(resume) from Resume resume join resume.applicant where ");

            for (int i = 0; i < keywordsInSummary.length; i++) {
               jpaql.append("resume.summary like '%"
                  + keywordsInSummary[i] + "%' ");
               if (i < keywordsInSummary.length - 1)
                  jpaql.append(" and ");
            }

            Session session = ((Session) em.getDelegate());

            session.enableFilter("rangeFilter").setParameter("beginDate",
               beginDate).setParameter("endDate", endDate); // Hibernate Filter

            return ((Long) getJpaTemplate().find(jpaql.toString())
               .iterator().next()).intValue();
         }
      });
      return (Integer) results;
   }

   public int seFindMatchCount(final Date beginDate, final Date endDate,
         final String... keywordsInSummary) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
         public Object doInJpa(EntityManager em) throws PersistenceException {

            FullTextEntityManager fullTextEntityManager = createFullTextEntityManager(em);

            BooleanQuery bq = new BooleanQuery();

            for (String q : keywordsInSummary) {
               TermQuery tq = new TermQuery(new Term("summary", q));
               bq.add(new BooleanClause(tq, BooleanClause.Occur.MUST));
            }

            FullTextQuery fq = fullTextEntityManager.createFullTextQuery(
               bq, Resume.class);

            FullTextFilter ff = fq.enableFullTextFilter("rangeFilter");
            ff.setParameter("fieldName", "lastUpdated");
            ff.setParameter("lowerTerm", DateTools.dateToString(beginDate,
               DateTools.Resolution.DAY));
            ff.setParameter("upperTerm", DateTools.dateToString(endDate,
               DateTools.Resolution.DAY));
            ff.setParameter("includeLower", true);
            ff.setParameter("includeUpper", true);

            return fq.getResultSize();
         }
      });
      return (Integer) results;
   }

If you expect a large set of results returned from a query, you'll need to paginate those results -- that is, break them up over several Web pages. Pagination makes your Web pages look nice when you put the complete results on the screen, and displays Google-like search results that your users can page through. You might end up with an out-of-memory error from your JVM or out-of-cursor error from your database without pagination enabled. The two methods in Listing 6, dbFindResumesWithPagination() and seFindResumesWithPagination(), show you how easy it is to enable pagination in JPA and Hibernate Search.

Listing 6. Pagination in JPA and Hibernate Search

@SuppressWarnings("unchecked")
   public List<Resume> dbFindResumesWithPagination(final int fetchCursor,
         final int fetchSize, final Date beginDate, final Date endDate,
         final String... keywordsInSummary) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
         public Object doInJpa(EntityManager em) throws PersistenceException {

            StringBuilder jpaql = new StringBuilder(
               "from Resume resume join fetch resume.applicant where ");

            for (int i = 0; i < keywordsInSummary.length; i++) {
               jpaql.append("resume.summary like '%"
                  + keywordsInSummary[i] + "%' ");
               if (i < keywordsInSummary.length - 1)
                  jpaql.append(" and ");
            }

            Session session = ((Session) em.getDelegate());

            session.enableFilter("rangeFilter").setParameter("beginDate",
               beginDate).setParameter("endDate", endDate);

            Query query = em.createQuery(jpaql.toString());

            query.setFirstResult(fetchCursor);
            query.setMaxResults(fetchSize);

            return (List<Resume>) query.getResultList();
         }
      });
      return (List<Resume>) results;
   }

   @SuppressWarnings("unchecked")
   public List<Resume> seFindResumesWithPagination(final int fetchCursor,
         final int fetchSize, final Date beginDate, final Date endDate,
         final String... keywordsInSummary) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
         public Object doInJpa(EntityManager em) throws PersistenceException {

            FullTextEntityManager fullTextEntityManager = createFullTextEntityManager(em);

            BooleanQuery bq = new BooleanQuery();

            for (String q : keywordsInSummary) {
               TermQuery tq = new TermQuery(new Term("summary", q));
               bq.add(new BooleanClause(tq, BooleanClause.Occur.MUST));
            }

            FullTextQuery fq = fullTextEntityManager.createFullTextQuery(
               bq, Resume.class);

            FullTextFilter ff = fq.enableFullTextFilter("rangeFilter");
            ff.setParameter("fieldName", "lastUpdated");
            ff.setParameter("lowerTerm", DateTools.dateToString(beginDate,
               DateTools.Resolution.DAY));
            ff.setParameter("upperTerm", DateTools.dateToString(endDate,
               DateTools.Resolution.DAY));
            ff.setParameter("includeLower", true);
            ff.setParameter("includeUpper", true);

            fq.setFirstResult(fetchCursor);
            fq.setMaxResults(fetchSize);

            return (List<Resume>) fq.getResultList();
         }
      });
      return (List<Resume>) results;
   }

More importantly, the seFindResumesWithDocHandler() method in Listing 7 searches technical keywords inside the Word-formatted resume files. As you saw earlier, a custom FieldBridge WordDocHandlerBridge is declared through the @FieldBridge annotation in the Resume entity class. The custom FieldBridge shown in Listing 8 makes use of Apache POI to extract all the text out of the Microsoft Word documents for indexing purposes. Searching is conducted over this extracted text.

Listing 7. Searching through Word-formatted resume files

@SuppressWarnings("unchecked")
   public List<Resume> seFindResumesWithDocHandler(final Date beginDate,
         final Date endDate, final String... keywordsInWordDoc) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
      public Object doInJpa(EntityManager em) throws PersistenceException {
         FullTextEntityManager fullTextEntityManager = createFullTextEntityManager(em);
         BooleanQuery bq = new BooleanQuery();
         for (String q : keywordsInWordDoc) {
            TermQuery tq = new TermQuery(new Term("resume", q)); //Word Doc
            bq.add(new BooleanClause(tq, BooleanClause.Occur.MUST));
         }

         FullTextQuery fq = fullTextEntityManager.createFullTextQuery(
            bq, Resume.class);
         FullTextFilter ff = fq.enableFullTextFilter("rangeFilter");
         ff.setParameter("fieldName", "lastUpdated");
         ff.setParameter("lowerTerm", DateTools.dateToString(beginDate,
         DateTools.Resolution.DAY));
         ff.setParameter("upperTerm", DateTools.dateToString(endDate,
            DateTools.Resolution.DAY));
         ff.setParameter("includeLower", true);
         ff.setParameter("includeUpper", true);
         return (List<Resume>) fq.getResultList();
         }
      });
      return (List<Resume>) results;
   }

Listing 8. A custom FieldBridge example

package demo.hibernatesearch.dao.hibernate.utils;

public class WordDocHandlerBridge implements StringBridge {

   public String objectToString(Object arg0) {

      StringBuilder _result = new StringBuilder();
      try {
         ByteArrayInputStream bais = new ByteArrayInputStream((byte[]) arg0);
         org.apache.poi.hwpf.HWPFDocument doc = new org.apache.poi.hwpf.HWPFDocument(
            bais);
         Range range = doc.getRange();
         int np = range.numParagraphs();
         for (int i = 0; i < np; i++) {
            _result.append(range.getParagraph(i).text());
            _result.append(" ");
         }
      } catch (IOException ex) {
         ex.printStackTrace();
      }
      return _result.toString();
   }

}

A reporting query returns a subset of the entity beans' properties rather than the full entity objects. This is a cost-effective query approach in that it reduces the overhead of unnecessary data fetching and table joins. Both JPA and Hibernate Search provide projection facilities to support reporting queries. Applying projections properly in Hibernate Search may avoid unnecessary database access, reduce memory usage, and improve the performance of an application.

In the sample application, when search results are displayed as a list on a Web screen, leaving the resume Word documents unfetched can significantly save memory. An applicant profile plus a summary of the Resume should provide enough information for users to make a selection. Once a user selects a row in the result list, the chosen Word document will then be fetched from the BLOB database column and returned through a separate request/response.

The seFindResumeProjectionsWithoutDatabaseAccess() method, shown in Listing 9, demonstrates how to employ projections in Hibernate Search to make a reporting query without database access. The method also shows how to get the relevance scores for Lucene search results. Keep in mind that the Resume objects in the results are not managed by the persistence context of JPA; therefore, they are detached JPA entities, with all the fields populated from the Lucene search results. Note that you lose a crucial benefit when employing these detached entities -- automatic dirty checking, a key factor that makes Hibernate backed applications perform great for data update operations.

Listing 9. Projections in a reporting query

@SuppressWarnings("unchecked")
   public Map<Resume, Float> seFindResumeProjectionsWithoutDatabaseAccess(
         final Date beginDate, final Date endDate,
         final String... keywordsInSummary) {
      Object results = getJpaTemplate().execute(new JpaCallback() {
         public Object doInJpa(EntityManager em) throws PersistenceException {

            FullTextEntityManager fullTextEntityManager = createFullTextEntityManager(em);

            BooleanQuery bq = new BooleanQuery();

            for (String q : keywordsInSummary) {
               TermQuery tq = new TermQuery(new Term("summary", q));
               bq.add(new BooleanClause(tq, BooleanClause.Occur.MUST));
            }

            FullTextQuery fq = fullTextEntityManager.createFullTextQuery(
               bq, Resume.class);

            FullTextFilter ff = fq.enableFullTextFilter("rangeFilter");
            ff.setParameter("fieldName", "lastUpdated");
            ff.setParameter("lowerTerm", DateTools.dateToString(beginDate,
               DateTools.Resolution.DAY));
            ff.setParameter("upperTerm", DateTools.dateToString(endDate,
               DateTools.Resolution.DAY));
            ff.setParameter("includeLower", true);
            ff.setParameter("includeUpper", true);

            fq.setProjection(FullTextQuery.SCORE, "id",
               "summary",
               "applicant.id", "applicant.firstName",
               "applicant.lastName", "applicant.middleName",
               "applicant.emailAddress");

            Map<Resume, Float> resumes = new HashMap<Resume, Float>();

            for (Object[] result : (List<Object[]>) fq.getResultList()) {
               Resume resume = new Resume();
               User applicant = new User();
               resume.setApplicant(applicant);
               resume.setId((Long) result[1]);
               resume.setSummary((String) result[2]);
               /** WordDoc content is left blank. */
               applicant.setId((Long) result[3]);
               applicant.setFirstName((String) result[4]);
               applicant.setLastName((String) result[5]);
               applicant.setMiddleName((String) result[6]);
               applicant.setEmailAddress((String) result[7]);
               resumes.put(resume, (Float) result[0]);
            }
            return resumes;
          }
      });
      return (Map<Resume, Float>) results;
   }

Spring service layer

Service classes annotated by @service in Spring 2.5 are always a great place to specify the transaction characteristics of your Web applications. This is especially true when managing XA transactions across multiple data sources. Different DAO classes may collaborate to complete an XA transaction defined in a single method of a service class. In the ResumeManagerImpl class, shown in Listing 10, each business method is marked with the Spring @Transactional annotation, with propagation, readyOnly, and isolation as attributes. Don't confuse this with the EJB 3 @TransactionAttribute annotation, although both are designed for a similar goal. Hibernate Search by default encapsulates Lucene indexing processes with database transactions; hence, indexes are only updated when database operations are committed.

Listing 10. ResumeManagerImpl.java

package demo.hibernatesearch.service.impl;

@Service("resumeManager")
public class ResumeManagerImpl implements ResumeManager {

   protected final Log log = LogFactory.getLog(getClass());

   @Autowired
   private ResumeDao resumeDao;

   @Transactional(propagation = Propagation.REQUIRED, readOnly = false, isolation = Isolation.READ_COMMITTED)
   public void saveApplicant(User applicant) {
      resumeDao.saveApplicant(applicant);
   }

   //...
}

Spring 2.5 POJO test cases

Another advantage of the newly released Spring 2.5 annotations: they enable POJO test cases, as illustrated in Listing 11. By default, each test method in a test class runs under a single transaction context, and the transaction rolls back at the end of the method. In the sample application, this default behavior has been suppressed with a @Rollback(false) annotation on each test method, so that the results are committed to the database and Lucene indexes. Spring's JUnit/TestNG extensions allow you to run your test cases out of the container. All you need to do is to execute the Maven 2 built-in lifecycle phase -- mvn test.

Listing 11. POJO test cases

package demo.hibernatesearch.dao;

import static junit.framework.Assert.*;

@RunWith(SpringJUnit4ClassRunner.class)
@TestExecutionListeners( { DependencyInjectionTestExecutionListener.class,
   TransactionalTestExecutionListener.class })
@ContextConfiguration(locations = "/WEB-INF/applicationContext*.xml")
@Transactional
public class ResumeDaoTest {

   @Autowired
   private ResumeDao resumeDao;

   //...

   @Test
   @Rollback(false)
   public void testSeFindResumesWithDocHandler() throws Exception {
      List<Resume> resumes = resumeDao.seFindResumesWithDocHandler(
         new GregorianCalendar(2006, 1, 1).getTime(),
         new GregorianCalendar().getTime(), "java", "web");

      assertTrue(resumes.size() == 5);
   }

   //...
}

Configuration

Two sets of configuration files are declared in the sample project -- one for unit testing, and the other packed into a WAR file to be deployed on a Web server. The JPA settings, including entityManagerFactory and transactionManager, are configured in the Spring application context XML files. If you enable bean autowiring in Spring, the XML files become very succinct. The settings for Hibernate Search are specified in the hibernate.cfg.xml file, shown in Listing 12. This file is referenced in the JPA persistence.xml file as a vendor proprietary property.

Listing 12. hibernate.cfg.xml

<hibernate-configuration>
   <session-factory>
      <property name="hibernate.search.default.directory_provider">
      org.hibernate.search.store.FSDirectoryProvider
      </property>
      <property name="hibernate.search.default.indexBase">./lucene/indexes</property>
      <property name="hibernate.search.default.batch.merge_factor">10</property>
      <property name="hibernate.search.default.batch.max_buffered_docs">10</property>
      
      <mapping class="demo.hibernatesearch.model.User" />
      <mapping class="demo.hibernatesearch.model.Resume" />
   </session-factory>
</hibernate-configuration>
Where's the Web tier?
The Web tier is not implemented in the sample application. The AppFuse and AppFuse Light projects hosted by Java.net provide you with great project templates covering a variety of Web frameworks. It will be very easy for you to pick a template of your favorite Web technology to integrate with this article's sample application.

The first attribute specifies the type of directory for Lucene -- a file system directory, in this case. The second attribute, indexBase, identifies where the index files reside. The merge_factor is a Lucene setting related to disk I/O. The value of 10 is the default. max_buffered_docs controls how many Lucene Document objects can be buffered during indexing. The last two parameters may be used for performance tuning. Normally indexBase is the only thing in this file you'd need to touch.

Luke: Lucene index toolbox

Lucene comes with a set of handy toolkits with which developers can browse, manipulate, and search index files. Luke, shown in Figure 1, is a Java Swing application, and is very powerful and comprehensive. Pointing Luke to the index base created by the sample application, you will see two indexes, demo.hibernatesearch.model.Resume and demo.hibernatesearch.model.User. Opening the Resume index, you can gain insights into the Lucene Document structure. There is a list of fields defined inside the Document structure, among which <_hibernate_class> is created by Hibernate Search to identify the persistence entity class. The Document structure also reflects the nature of entity relationships in Lucene indexes. The relationship from Resume to User is denormalized, and nested in the primary Resume index. Note that embeddable objects (components) are not indexed independently as entities are. Figure 1 shows the Luke UI.

Figure 1. The document structure of the resume index displayed in Luke

Clustering and more

Hibernate Search provides two solutions for clustered sever environments. The easier one is to point the index base to a shared network directory. The more robust and high-performance approach is to leverage JMS for asynchronous index updates between a master and several slave nodes where Web servers run. In a nutshell, indexing operations occurring in each slave node are queued in a JMS destination for updating a master copy of the index files; meanwhile, the master node periodically synchronizes its master copy of the index files with the ones on each slave node. A drawback of this approach is that changes to the index will not immediately be available on slave nodes. More details about clustering, index sharing, manual indexing, and performance tuning may be found in the Hibernate Search Reference Guide.

In conclusion

Relational databases and search engines are not mutually exclusive technologies. Hibernate Search brings the power of Lucene full-text searching to Hibernate ORM through a high-level, universal API without compromising the database-level portability of the application. It seamlessly and transparently integrates the Lucene indexing processes with the Hibernate/JPA-managed database operations on the persistence domain objects. Programming cost and time are greatly reduced due to auto-indexing in Hibernate Search and the annotations in Hibernate Search, Hibernate/JPA, and the Spring 2.5 application framework.

Dr. Xinyu Liu is a Sun Microsystems certified enterprise architect working in a healthcare corporation.

Learn more about this topic