[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Clustering Carrot2 vs TermVector Analysis
From:       Andrew Boyd <andrew.boyd () mindspring ! com>
Date:       2005-05-30 15:07:37
Message-ID: 26330038.1117465657707.JavaMail.root () wamui-andean ! atl ! sa ! earthlink ! net
[Download RAW message or body]

Hi All,
  By using the carrot demo:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

 I was able to easliy cluster search results based on the fields used by carrot( url, \
title, and summary).   However I was wondering if there was a way to do something \
similar using term vector analysis and the built in TermVector / Similarity api.

Please bear with me as I'm just learning about term vector analysis mostly from:
http://www.miislita.com/term-vector/term-vector-1.html

Where it discusses wi = tfi * IDFi

I've ordered the book Information Retrieval: Algorithms and Heuristics but it has not \
shown up yet.

Any way here is my question:

After doing a typical lucene search how can I get the  top 5 "key terms" for each of \
the top ten documents.  I was thinking that I sum these and then have a type of \
cluster.

When we do a search we have the query vector that we use to get the similarity used \
for ranking. So when we do a query the query terms are the "key terms".  If we dont \
have a query vector is there a way to get the "key terms" from a document?  Of course \
there if tf but every thing I'm reading says that tf is not ideal.  So I guess my \
question boils down to 

     how using the lucene api can I get the top 5 wi= tfi * IDFi of a given document.

If you have any suggestions or if I'm off base I'd really appreciate the help.

Thanks,

Andrew

 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic