'How to make word-N-gram based query and interpolate each N-gram score to obtain final Lucene score'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    How to make word-N-gram based query and interpolate each N-gram score to obtain final Lucene score
From:       Rajen Chatterjee <rajen.k.chatterjee () gmail ! com>
Date:       2016-01-11 8:43:42
Message-ID: CAC4-+NxFLTZqh59wpk3xFvo5g8LWjjNKKJ01o3SYua=tupKkKg () mail ! gmail ! com
[Download RAW message or body]


Hello Everyone,

I am looking for some method which can help me to build *word-N-gram* based
queries.
After doing some search I think that I have to define an analyzer as
follows:

public static Analyzer wordNgramAnalyzer(final int minShingle, final int
maxShingle) {
        return new Analyzer() {
            @Override
            public TokenStream tokenStream(String fieldName, Reader reader)
{
               return new ShingleFilter(new WhitespaceTokenizer(reader),
minShingle, maxShingle)
            }
        };
    }
This analyzer will help to get unigram, bigram, trigram,... tokens, which I
can use during indexing as well as at the query time.
So, can anyone please tell me:
1) Is this the right approach to index and query word-N-gram?
2) Is there any way to set weights to the N-grams, like at the query time
tri-gram based tokens should have higher weight than an uni-gram based token
(something like the final lucene score should be interpolation of uni-gram
score, bi-gram score, tri-gram score,... and so on)

Any help is much appreciated.

Thanks

-- 
-Regards,
 Rajen Chatterjee.


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic