JavaWorld
addict
Reged: 06/20/03
Posts: 482
|
|
Use search engine technology for object persistence
|
Muthu Ramadoss
Unregistered
|
|
Nice Idea.. I like it. You must Patent this 
An Out of the Box thinking on Persistence. Congrats!
|
svaucher
stranger
Reged: 09/08/03
Posts: 7
Loc: Montreal, QC
|
|
So if I understand correctly, you are NOT using Lucene because it's a search engine, but rather because it's got a fast datastore? I'll explain:
You do not mention anything about scoring (order in which the documents appear) which is a measure of a search engine's quality. Also, since Lucene only holds strings, your stored object acts as a simple key/value repository. A relational db would be infinitely more efficient for complex queries like: get me all the objects with values between 0 and 1 000 000, and would allow transactions.
I'm not trying to be difficult, but I don't understand the value of your solution as you are basically using Lucene as a database without db advantages...
-------------------- Stephane Vaucher
Web page
|
Anonymous
Unregistered
|
|
Take a look at ZOE.
It uses such approach for over three years 
Source code
|
Anonymous
Unregistered
|
|
Without database advantages but also without database limitations. First of all there are plenty of environments where "real" database is not available (and Hypersonic would not scale), secondly, this is sort of object database so you don't have to worry object-relational mapping or clumsy JDBC code.
|
svaucher
stranger
Reged: 09/08/03
Posts: 7
Loc: Montreal, QC
|
|
Let me nuance my initial statement: I find the idea of indexing classes interesting as I'm interested in understanding class similarity. However, the exmples you use do not show any reason for using an IR engine as you wish it were a simplified db.
Quote:
Without database advantages but also without database limitations. First of all there are plenty of environments where "real" database is not available (and Hypersonic would not scale),
Agreed, lucene offers a scalable file-based datastore. It offers an efficient insert and remove (updates mentioned below). You must however manage your indexes to make sure everyone uses the latest and greatest one as there are no transactions. If you are working in a single threaded environment, it should be good and efficient.
Quote:
secondly, this is sort of object database so you don't have to worry object-relational mapping
OR mappings offer more than key/value capabilities (it can map collections and objects). If you don't need to map anything but strings, then your solution is elegant.
Quote:
or clumsy JDBC code.
Be it JDBC code or lucene retrieval code, the idea is basically the same. Elegance is subjective in this case (I prefer a standard as most developers know sql and I've worked with lucene for the past 2 years). Whether, you use a SELECT * FROM FOO WHERE NAME="FOO"; or name:"FOO", you still need to express your requirement with boolean statements (I'll assume you don't use vectors and fuzzy queries) and iterate through your result set.
One final note that is probably interesting for most readers, updates are not automagic in lucene, you need to remove the document and add the updated one. This is not a bug, it is the result of the design of an efficient information retrieval engine and not a database.
That's all I have to say. I genuinely believe you are using the wrong tool because you don't understand the difference between a search engine and a database. There are cool possibilities in using an IR engine for what you want to achieve, but you haven't mentioned them.
-------------------- Stephane Vaucher
Web page
|
Mikhail Garber
Unregistered
|
|
Looks like you were replying to me and not just to Anonymous poster so here we go ;-) For the readers' benefit, I have to point out that MAOS library supports more than "just strings" as object attributes. Any Java object with a constructor taking a single String argument is acceptable so Integer, Double, etc. or even user-defined more complex types will work too. Most importantly, the goal was not to compete with databases but to create a simple datastore people can extend as they see fit for their projects for the situations where classical database is not appropriate yet high query performance is still required. By the way, simple transactional add-on implementation was just contributed by someone else and will be in the official MAOS code shortly.
|
svaucher
stranger
Reged: 09/08/03
Posts: 7
Loc: Montreal, QC
|
|
Sorry, case of mistaken identity 
Good to see that lucene has some simple transaction support and that you've added support for some basic types. I'm assuming that you've implemented something to support updates in a transparent manner.
Even though I still don't see the usefulness of this approach (using a search engine as a datastore), I know that lucene's directories are scalable and efficient (except for updates and memory intensive queries, like ranged or prefix) and I guess could be used in niche apps.
Good luck,
-------------------- Stephane Vaucher
Web page
|
Sione
Unregistered
|
|
The author is trying to demonstrate of how LUCENE can be adopted for certain Java application development. He does not mean that LUCENE is to be all and end all for every module of an application that needs search engine. I find the article very interesting.
If you are talking about searching for similar concept documents then literal word search technique as LUCENE would not work. Here is a brief description of how GOOGLE uses the concept of document similarity to do its search.
----------------------------------------------------
Since there are usually many ways to express a given concept (synonymy), the literal terms in a user's query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user's query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document.
---------------------------------------------
The brief description given above is how 'PageRank' algorithm works. 'PageRank' is the algorithm that GOOGLE uses in their search engine. This algorithm accomodates matrix factorizations (decomposition) of linear algebra to solve documents similarity concepts which is something that is missed by literal word search as LUCENE.
cheers, sione
|