'Finding similar files with Lucene'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Finding similar files with Lucene
From:       "Dmitri Mamrukov" <dym () att ! net>
Date:       2003-11-30 1:03:19
[Download RAW message or body]

I'd like to hear some comments and guidance about this.

My project's requirements are summarized below:

1. The application takes as input the name of a directory (which will
contain some files) and the name of a file and returns the 2 files that are
most similar to it using the TF-IDF scoring mechanism.
2. Now suppose that the files being compared are in HTML. Do the same as
above given the additional condition: a term between <tag> and </tag> would
count twice as much as usually.

I chose Lucene as a toolkit of my choice. I'm outlining my approach of
solving the problem:

1. For the preprocessing part, I index the document directory (the one
containing files to be indexed). This is a straightforward operation: text
files are fully read and indexed; HTML files are parsed for data which is
subsequently indexed.
2. For the searching part, I read the query file (in the case with HTML
files, data parsing is done first) through TokenStream (which does some
token clean-up) and construct a (potentially long) query string which is
passed to QueryParser to get a Query object. Subsequently, I get search hits
if any. For enhancing specific terms, I use the Lucene boost factor syntax.

   // Gets tokens to construct a query string (without repetitive terms).
   public static HashSet getTokens(Reader reader, Analyzer analyzer) throws
IOException
   {
      HashSet tokens = new HashSet();
      TokenStream stream = analyzer.tokenStream(null, reader);
      Token token = null;
      while ((token = stream.next()) != null)
      {
         String t = token.termText();
         if (!tokens.contains(t))
         {
            tokens.add(t);
         }
      }
      return tokens;
   }


NOTE: I'm aware about using the same analyzer for both indexing and
searching phases.

I'd like to know if this approach is sufficient. I'm concerned about using
QueryParser because the query parser is designed for human-entered text, not
for program-generated text. Is there any way to construct a Query object
other than directly using QueryParser to satisfy the project requirements?

So far my application has behaved well (albeit with somewhat high CPU usage)
for long files.

Thanks in advance,
Dmitri


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic