Introduction to Hibernate Search

Bring the power of Lucene to your database-backed applications

1 2 3 4 5 6 7 Page 3
Page 3 of 7

POJO persistence classes

I prefer to begin application programming with a persistence domain model. Two POJO persistence model classes, Resume.java and User.java, are defined in the sample application, as shown in Listing 1.

Listing 1. Resume.java and User.java

package demo.hibernatesearch.model;

@Entity
@Table(name = "resume")
@FilterDefs( { @FilterDef(name = "rangeFilter", parameters = { 
   @ParamDef(name = "beginDate", type = "date"), 
   @ParamDef(name = "endDate", type = "date") }) })
@Filters( { @Filter(name = "rangeFilter", condition = 
   ":beginDate <= lastUpdated and :endDate >= lastUpdated") })
@Indexed
@Analyzer(impl = org.apache.lucene.analysis.standard.StandardAnalyzer.class)
@FullTextFilterDefs( { @FullTextFilterDef(name = "rangeFilter", 
   impl = demo.hibernatesearch.dao.hibernate.utils.RangeFilter.class, cache = true) })
public class Resume implements Serializable {

   @Id
   @GeneratedValue
   @DocumentId
   private Long id;

   @OneToOne
   @IndexedEmbedded
   private User applicant;

   @org.hibernate.annotations.Index(name = "summaryIndex")
   @Field(index = Index.TOKENIZED, store = Store.YES)
   private String summary;

   @Lob
   @Field(name = "resume", index = Index.TOKENIZED, store = Store.NO)
   @FieldBridge(impl = demo.hibernatesearch.dao.hibernate.utils.WordDocHandlerBridge.class)
   private byte[] content; // MS Word Doc

   @Temporal(value = TemporalType.DATE)
   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   @DateBridge(resolution = Resolution.DAY)
   @Boost(2.0f)
   private Date lastUpdated; 

   //...
}

@Entity
@Table(name = "user")
@Indexed
@Analyzer(impl = org.apache.lucene.analysis.standard.StandardAnalyzer.class)
public class User implements Serializable {

   @Id
   @GeneratedValue
   @DocumentId
   private Long id;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String firstName;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String lastName;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String middleName;

   @NaturalId
   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String emailAddress;

   @OneToMany(cascade = { CascadeType.ALL }, mappedBy = "applicant")
   @ContainedIn
   private Set<Resume> resumes;

   //...
}

The two persistence classes are annotated with the JPA @Entity tag, which declares that their nontranisent properties will be persisted to a relational database. A Maven 2 Hibernate plugin goal (mvn hibernate3:hbm2ddl) outputs a SQL script from the annotated entity Java source files, and creates the corresponding database schema in the MySQL database.

Apart from the JPA annotations, the two entity classes are also marked with the new Hibernate Search annotations. Any JPA entity class marked with the @Indexed annotation is enabled for Lucene indexing, and is mapped to a unique Lucene index. Hibernate Search implicitly matches an entity instance to a Lucene Document object. More specifically, only bean properties annotated by @Field are indexed as Fields in the Lucene Document objects. As discussed earlier, Lucene Document objects are the data unit for indexing and search as JPA entities in database persistence. Note that you don't have to index all of your JPA entity classes with Lucene, only those for which full-text search is required.

Even though a Lucene Document by itself doesn't enforce a unique key field, Hibernate Search requires you to specify a document ID field through a @DocumentId annotation. Most of the time, this ID is also a database primary key. Hibernate Search uses that field internally to match a Lucene Document object to an entity instance.

Lucene indexing can't deal with any data type other than text strings; thus, all bean properties to be indexed must be converted to their string representations. A FieldBridge in Hibernate Search works as a data type converter in JSF or Spring that transforms any data type into text. The built-in FieldBridges take care of all the built-in Java data types. Nevertheless, a byte array property in the Resume entity hosting Microsoft Word files requires special care. I developed a custom FieldBridge WordDocHandlerBridge as part of the sample application to extract pure text from Word documents; I used Apache POI, a Java API for accessing Microsoft Office files. The same concept exists in Lucene as DocumentHandler. A wide variety of DocumentHandler implementations are available online for certain binary data types, including Word, Excel, PDF, HTML, and XML. It is nearly effortless to build custom FieldBridges out of the existing Lucene DocumentHandlers.

Two attributes of the @Field annotation -- index=Index.TOKENIZED and store=Store.NO -- characterize Lucene indexing features, and so does the @Boost annotation. The first attribute ensures that the text will be tokenized by a Lucene Analyzer. Tokenizing is a critical step in indexing that preprocesses the source data before actual indexing takes place. Preprocessing involves removal of stop words, replacement of stemming words, and so on. I choose the Lucene StandardAnalyzer for the sample application. Bear in mind that you should not tokenize any primary key or natural key field.

The attribute store=Store.NO indicates that the actual data (that is, the Word documents) will not be stored in the index; therefore, that bean property will be returned through a separate SQL query as part of the Hibernate full-text search that returns the Resume entity objects. In contrast, if I had set store=Store.YES, the original Word documents would have been stored in and retrieved from the index. The boost factor identified by @Boost gives more or less weight for the annotated property in Lucene index search, and affects the relevance scores of the search results.

You may notice a bidirectional many-to-one relationship between the Resume and User entities. The data model in Lucene's Document class doesn't handle relationships through foreign key references, as a database does. In order to enable full-text search on these relationships, you must denormalize the nested objects as dot-navigated fields in the Lucene Document objects. By marking the applicant property of the Resume entity with an @IndexedEmbedded annotation, I advise Hibernate Search to index the applicant property as a list of nested fields inside the Resume index. In the section below entitled "Luke: Lucene index toolbox," you'll see what the Resume index looks like.

The @IndexedEmbedded annotation is applicable to both @*ToOne entity relationships and embedded objects (components, in Hibernate terminology). @*ToMany relationships and collections of embedded objects are not supported. As a result, you can't make use of full-text search on an entity property of a collection type; instead, the search should be initiated from the singular end of the relationship. It is obvious that the Resume index now must be updated in accordance with any change on a User entity. I place a @ContainedIn annotation on the Resume collection of the User entity to advise Hibernate Search of that dependency. Note that the @ContainedIn annotation is only required on entity relationships, and not on embeddable (value type) objects.

1 2 3 4 5 6 7 Page 3
Page 3 of 7