'Augmenting an existing index (was: ACLs and Lucene)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Augmenting an existing index (was: ACLs and Lucene)
From:       Sebastian Marius Kirsch <skirsch () sebastian-kirsch ! org>
Date:       2005-05-30 20:20:52
Message-ID: 20050530202052.GN915 () amok ! local
[Download RAW message or body]

Hello,

I have a similar problem, for which ParallelReader looks like a good
solution -- except for the problem of creating a set of indices with
matching document numbers.

I want to augment the documents in an existing index with information
that can be extracted from the same index. (Basically, I am indexing a
mailing list archive, and want to add keyword fields to documents that
contain the message ids of followup messages. That way, I could
quickly link to the followup messages from original
message. Unfortunately, I don't know the ids of all followup messages
until after I indexed the whole archive.)

I tried to implement a FilterIndexReader that would add the required
information, but couldn't get that to work. (I guess there's more to
extending FilterIndexReader than just overriding the document() method
and tacking a few more keyword fields on to the document before
returning it.) When I add my FilterIndexReader to a new IndexWriter
with the .addIndexes() method, it seems to work, but when I try to
optimize the new index, I get the following error:

merging segments _0 (1900 docs)Exception in thread "main" \
java.lang.ArrayIndexOutOfBoundsException: 100203040  at \
java.util.ArrayList.get(ArrayList.java:326)  at \
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)  at \
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66)  at \
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237)  at \
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:185)  at \
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:92)  at \
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)  at \
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)  at \
org.sebastiankirsch.thesis.util.MailFilterIndexReader.main(MailFilterIndexReader.java:210)


If I don't optimize the index, I don't get an error, but Luke cannot
read the new index properly. I guess this has something to do with me
messing with the documents without properly adjusting the index terms
etc.

At the moment, I index the whole archive twice, and use the info from
the first index to add missing fields to the second index. However, it
would save me a lot of work (and processing power, of course) if I
could just postprocess the index from the first pass without
re-indexing the messages. Furthermore, it would open up the
possibility to apply even more passes to the postprocessing. (I'm
probably going to need that soon.)

I presume that a ParallelIndexReader could be merged into a single
index using addIndexes()? So if the problem of keeping the doc numbers
in sync can be solved ...

Alternatively, I would welcome hints as to how to implement a
FilterIndexReader properly.

Thanks very much for your time, Sebastian

On Mon, May 30, 2005 at 11:32:13AM -0400, Robichaud, Jean-Philippe wrote:
> What about:
> http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/luce
> ne/index/ParallelReader.java?rev=169859&view=markup

-- 
Sebastian Kirsch <skirsch@sebastian-kirsch.org> [http://www.sebastian-kirsch.org/]

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic