'[KinoSearch] positional scoring'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kinosearch
Subject:    [KinoSearch] positional scoring
From:       marvin () rectangular ! com (Marvin Humphrey)
Date:       2007-07-10 2:14:25
Message-ID: B982504C-CE0B-4C97-9D84-5AFC17D15575 () rectangular ! com
[Download RAW message or body]


On Jul 9, 2007, at 4:05 PM, Nathan Kurz wrote:

Nathan, please forgive the brevity of my response... I'm beat...

> Hi Marvin ---
>
> I'm still trying to figure out how to architect the custom scoring  
> apparatus
> I want, and I could use someone to talk a little sense into me.  I
> fear I'm heading down a blind alley.  Since part of what I want could
> be viewed as a generic proximity scorer, it's possible you've already
> thought this through better than I've been able to.
>
> I started with what seemed like the reasonable assumption: I wanted a
> proximity scorer that was able to wrap itself around an existing
> scorer.  It would, for example, have an internal ANDNotScorer, which
> would run to produce matching documents.  The ANDNot scorer would make
> the positions of the matches available to its parent, and the
> ProximityScorer would use these positions to determine a score.  The
> alternative would be to have a more complex ProximityScorer, which
> finds the matching document as well as assessing the positions of the
> matches. It seemed cleaner and more flexible to keep each level
> simpler, but I'm finding it more complex to implement than I expected.

Tally is the vessel that's intended to facilitate this.  Instead of  
just asking for scorer.score() as in Lucene, you ask for Scorer_Tally 
() in KS, which is supposed to be an extendable object for holding a  
score, a doc num, and any other relevant data -- such as arrays of  
matching positions.

The problem is that you need to keep separate positions arrays for  
each field.
>
> It's probably easiest to think about this in terms of the existing
> PhraseScorer, which currently takes the do-it-all approach.  What
> would it take to redesign PhraseScorer to so that rather than finding
> documents that contain all terms and then checking their positions, it
> would wrap an internal ANDScorer and just do the position checking?
> And what would be the benefits and costs of this approach?
>
> The main change would be that the ANDScorer would have to allow the
> parent PhraseScorer access to its children's position records.  In the
> simple case where the ANDScorer's children are TermScorers, this is
> fairly simple: it can just create a Positions array, with one element
> for each of its children[*].  But if the children are potentially
> complex (subphrases, for example), the ANDScorer must create a record
> that includes both the positions and the lengths of all its children's
> matches, otherwise the parent PhraseScorer will misjudge positions.
>
> The benefit of doing this would be that one can simplify some of the
> existing scorers and that it would be easier to add layers of
> encapsulating scorers.  It would make the layers more like building
> blocks that can be combined at will. The disadvantage would be that it
> is expensive to always create this composite position record.

Yes, absolutely true.  However, for individual TermScorers, we can  
use pointers to the original Posting from the PostingList, which has  
all the position data already.  It's only the aggregating scorers  
that need to build the composite position data.l

> One can
> avoid most of this expense by having a 'want_positions' flag that is
> set when the subscorer is created, but this starts adding complexity
> back.

I think it ought to be done with dedicated subclasses.

> One could also make the length records default to 1, so that
> existing position records can be used in situ.

I intentionally designed Posting with this application in mind...

>
> You mentioned you were interested in proximity scoring.  Is this a
> direction that interests you?  I think it has potential if one is
> trying to design a customizable search toolkit.  Or would I do better
> to concentrate on a solution that works well for the exact problem I'm
> trying to solve without trying to generalize it yet?  The generalized
> solution seems like a lot more work, and probably isn't worth doing
> unless you think it would be useful elsewhere.

A position-aware scoring suite would be very very nice.  It would  
improve IR precision, at the cost of some memory and some CPU.  I  
wanted to put something like this in 0.20, but as I mentioned in my  
last note, it's getting moved to a Project To Be Named Later.

Maybe a KSx::Search::PositionQueries should be the first KSx Scorer  
subclass distro. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic