[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-user
Subject: Clustering Carrot2 vs TermVector Analysis
From: Andrew Boyd <andrew.boyd () mindspring ! com>
Date: 2005-05-30 15:07:37
Message-ID: 26330038.1117465657707.JavaMail.root () wamui-andean ! atl ! sa ! earthlink ! net
[Download RAW message or body]
Hi All,
By using the carrot demo:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
I was able to easliy cluster search results based on the fields used by carrot( url, \
title, and summary). However I was wondering if there was a way to do something \
similar using term vector analysis and the built in TermVector / Similarity api.
Please bear with me as I'm just learning about term vector analysis mostly from:
http://www.miislita.com/term-vector/term-vector-1.html
Where it discusses wi = tfi * IDFi
I've ordered the book Information Retrieval: Algorithms and Heuristics but it has not \
shown up yet.
Any way here is my question:
After doing a typical lucene search how can I get the top 5 "key terms" for each of \
the top ten documents. I was thinking that I sum these and then have a type of \
cluster.
When we do a search we have the query vector that we use to get the similarity used \
for ranking. So when we do a query the query terms are the "key terms". If we dont \
have a query vector is there a way to get the "key terms" from a document? Of course \
there if tf but every thing I'm reading says that tf is not ideal. So I guess my \
question boils down to
how using the lucene api can I get the top 5 wi= tfi * IDFi of a given document.
If you have any suggestions or if I'm off base I'd really appreciate the help.
Thanks,
Andrew
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic