[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Re: Per User filtering of public/common documents
From:       Apostolis Xekoukoulotakis <xekoukou () gmail ! com>
Date:       2012-05-22 14:34:06
Message-ID: CAOX4E5FnxgsWg7PYdJuv96wP743+x3tRkmAAzmB+9ukX3=WOXg () mail ! gmail ! com
[Download RAW message or body]


Thanks Ian.. I am reading lucene code to understand how it works.

It seems that lucene 4 will be implementing docValues, that will contain a
field value per doc in a data structure that will be easily used during the
scoring process. This data structure will also allow for a lot of updates
to happen.
What I ll try to do should be similar. Instead of relying to lucene for
these data, there should be a way to provide those data (through a call to
a database) to the scoring/collecting/searching process.

The DocValue class can be used to store the keys of the documents. Since
merging segments right now is order
preserving, the mapping of docid to keys will be monotonic if we use the
same policy in the database, ie of incrementing the key in every new
document.

Now, I dont know exactly how things work  but I think that through this
monotonic mapping of docids to keys we could treat exterior data exactly
like a posting list, assuming our database sorts our keys(like leveldb).

It seems that when you ask lucene to fetch the N most relevant documents,
it creates an ordered by relevance N queue and then iterates over the
posting lists. The same thing could be done concurrently for the exterior
data.

We will have many queues depending on the number of orderings we want(ie
the different score systems we have).

We could then also implement a form of join in which documents will be put
in queues if they have a better score than the minimum score required in a
scoring system.

With this solution, we dont need to store all the documents per query, as
was the previous solution.

Can anyone give me a guide as to which abstract class of lucene I should
implement to do what I said? I am still confused as to which class does
what.



2012/5/21 Ian Lea <ian.lea@gmail.com>

> Certainly lots of questions, and I can't answer most of them, but a
> couple of comments/opinions.
>
> Collecting all docs will potentially use a lot of memory but isn't
> necessarily excessively slow.  It's generally only doing something
> like reading field values for all docs that can be prohibitively slow.
>
> Adding the username to docs and querying/filtering on that sounds a
> good idea.  If that data doesn't change much you can use a cached
> filter - that is generally very fast.  See CachingWrapperFilter.
>
>
> --
> Ian.
>
>
>
> On Fri, May 18, 2012 at 10:55 PM, Apostolis Xekoukoulotakis
> <xekoukou@gmail.com> wrote:
> > Let us say that we have N users that care about K of the M common
> documents
> > that exist in a database. What is the best way to filter the documents?
> >
> > The results will then be sorted per properties of the document,properties
> > that are stored in a database.(multidimensional score/sorting). Then the
> > top D^(number of properties) results can be extracted to be shown in the
> > users screen. For this to work, all hits need to collected from Lucene.
> >
> > (One of the properties is ofcourse relevance which is extracted from
> lucene)
> > (The other 'properties/ranking' of the documents will change a lot
> despite
> > the document remaining static.)
> >
> > What is the fastest way to do what I want? Can you explain your answer on
> > the algorithmic complexity of  the internals of lucene so as that I
> > understand lucene?
> >
> > I have heard that collecting all documents is time consuming. Why is
> that?
> > Arent all documents that match the terms of the query sorted by relevance
> > despite the fact that only n of them are selected?
> >
> > Some random thoughts/solutions:
> > In a new field, add to each document the name of the users that want to
> see
> > it, then pass the name in the query.
> >
> > Create and store a bitmap per user.
> > problem:the bitmap will change a lot since it depends on the properties
> > that change dynamically.
> >
> > Too many questions, sorry for that.
> >
> > --
> >
> >
> > Sincerely yours,
> >
> >     Apostolis Xekoukoulotakis
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 


Sincerely yours,

     Apostolis Xekoukoulotakis


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic