[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: BitSet implementation and large index
From:       Paul Elschot <paul.elschot () xs4all ! nl>
Date:       2005-02-14 19:30:38
Message-ID: 200502142030.38595.paul.elschot () xs4all ! nl
[Download RAW message or body]

On Monday 14 February 2005 18:31, jian chen wrote:
> Hi,
> 
> In database systems implementation, there is a type of index called
> bit map indexing. The bitset implementation could borrow idea from the
> database engine implementation.
> 
> You could squeeze all the 0's together and write how many of those
> 0's, that might be very memory saving.
> 
> There are various kinds of algorithms for doing this bitset
> compression. A good book for reference is the "Database
> impelmentations" from Ullman, and other two professors in Standford
> university.
> 
> Cheers,
> 
> Jian
> 
> 
> On Mon, 14 Feb 2005 09:29:26 -0600 (CST), tony@simpleobjects.com
> <tony@simpleobjects.com> wrote:
> > It seems that for a huge index, it might be a good idea to use a different
> > implementation of the BitSet when doing filtering (assuming the
> > non-filtered set is relatively small).  This would really help minimize
> > the memory required for each filter operation.
> > 
> > Since the default implementation of BitSet allocates enough memory for
> > each position in the set, it seems overkill for a set that has a small
> > number of "on" values.
> > 
> > Any thoughts?

Here is a compact sparse filter in RAM:

http://issues.apache.org/bugzilla/show_bug.cgi?id=32921

It is implemented with VInt's  on the document number differences.
A VInt is a compressed integer as defined in the Lucene index file format
(upper byte bit defines whether or not to continue the positive integer).

It takes less memory than a BitSet when less than roughly 1 in 8 docs
pass the filter. It uses about one byte per document in that case.

This is a FilteredQuery that can be used with the sparse filter:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32965

To use it with a BooleanQuery, you need the BooleanQuery from the 
development version.
The current BooleanScorer can score documents out of order, which is
not compatible with the order required by filtering using the stored document
number differences.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic