[prev in list] [next in list] [prev in thread] [next in thread] 

List:       slide-dev
Subject:    Re: Lucene and DASL (was Re: Interoperability between webdavfs and Apache Slide)
From:       Erik Hatcher <erik () ehatchersolutions ! com>
Date:       2003-11-30 23:42:48
[Download RAW message or body]

On Sunday, November 30, 2003, at 02:36  PM, Stefano Mazzocchi wrote:
> Lucene scalability is not impaired by the number of documents. You 
> basically create a matrix document/token and then create an hashtable 
> of the tokens and you get the documents (modulo how ranking is 
> performed, thru, I believe, sorting euclidean distance in the document 
> vector space between the query and the documents found)
>
> That's nice, has been used for decades in all full-text search engines 
> and can be optimized a lot (and lucene is a nice implementation of 
> those algorithms).
>
> But how do I use this for something that looks a lot like a relational 
> query?

The more we discuss it the more I am coming to the conclusion that 
Lucene for properties may not be the right approach, but I cannot say 
for sure.  It would at least have to be such that content indexing is 
separate from property indexing since properties are more likely to 
change than content and to update a document in a Lucene index it must 
be removed and re-added.

> My biggest fear is hitting the O(n) complexity: it might still run 
> like a breeze with 100 documents, but could crawl on its knees if you 
> reach 10000... and by the time you realize this, it's where you need 
> the repository the most because your data gets big and unmanageable 
> without a repository!
>
> Eric suggests that there could be ways to index documents and its 
> properties into lucene and then use DASL on it. What I want to 
> understand is the algorithmical complexity of such an approach.
>
> if it can be made O(1) or even O(log(n)), I'm sold. but if this gets 
> O(f(n)) where n is the document number and f(n) grows more than 
> log(n), well, we have a problem.

Like you said in the first sentence above though, Lucene scalability is 
not impaired by the number of documents.  It is by terms.  And for a 
property that has only a few values (like your workflow example 
earlier), it merely finds the term being queried for and then returns 
all the document id's that match (not even the full documents, just 
their id initially).

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic