[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    RE: new version of IndexWriter.java
From:       Doug Cutting <DCutting () grandcentral ! com>
Date:       2002-02-27 21:29:30
[Download RAW message or body]

It would be good to also know the average size of your documents, the size
of your index, and the amount of RAM required for each benchmark.

Lucene currently indexes using very little memory.  You're making it faster
by using more RAM.  In particular you're able to get a 10% speedup (58
versus 63 seconds) by using on the order of 5x the memory (100 docs in
memory versus 20).  That's not a great tradeoff, but it might be worth it to
some folks.

Doug

> -----Original Message-----
> From: Ivaylo Zlatev [mailto:IZlatev@entigen.com]
> Sent: Wednesday, February 27, 2002 12:29 PM
> To: Lucene Developers List
> Subject: RE: new version of IndexWriter.java
>
>
>
> My benchmarks show that my IndexWriter2.java performs better than the
> original IndexWriter.java,
> and - very important - preserves file system handles.
>
> Here are results of indexing on-and-the-same 11800 records on
> a poor SunBlade machine (1 cpu, 450mhz, decent IDE hard drive) with
> Solaris 8 OS.
> my "ulimit -n" (i.e. number of available file handles) is set
> to 1000 in
> all tests.
> I am using jdk 1.3.1 with upper memory limit of 512mBytes.
>
> IndexWriter with mergeFactor : 93 seconds
> IndexWriter with mergeFactor  : 63 seconds
> IndexWriter with mergeFactorP : 50 seconds
> IndexWriter with mergeFactor` : 48 seconds
> IndexWriter with mergeFactorp : 48 seconds
> IndexWriter with mergeFactor€ : Exception when adding ~6300-th
> Document: too many open files
> IndexWriter with mergeFactor0: Exception when adding ~9900-th
> Document: too many open files
>
> IndexWriter2 with maxDocsInRam0  , mergeFactor (used only at the
> end of indexing, during optimize() ) : 58 seconds
> IndexWriter2 with maxDocsInRam00  , mergeFactor (used only at the
> end of indexing, during optimize() ) : 45 seconds
> IndexWriter2 with maxDocsInRamP0  , mergeFactor (used only at the
> end of indexing, during optimize() ) : 42 seconds
> IndexWriter2 with maxDocsInRamp0  , mergeFactor (used only at the
> end of indexing, during optimize() ) : 42 seconds
> IndexWriter2 with maxDocsInRam00 , mergeFactor (used only at the
> end of indexing, during optimize() ) : 43 seconds
> IndexWriter2 with maxDocsInRam 00 , mergeFactor (used only at the
> end of indexing, during optimize() ) : 46 seconds
> IndexWriter2 with maxDocsInRamP00 , mergeFactor (used only at the
> end of indexing, during optimize() ) : 46 seconds
> IndexWriter2 with maxDocsInRam000, mergeFactor (used only at the
> end of indexing, during optimize() ) : 50 seconds
> IndexWriter2 with maxDocsInRam 000, mergeFactor (used only at the
> end of indexing, during optimize() ) : 56 seconds
>
> As you can see, IndexWriter2 with default settings (mergeFactor,
> maxDocsInRam 00) outperforms
> IndexWriter  with default settings (mergeFactor) twice (46 seconds
> compared to 93 seconds).
>
> Maybe you will ask why the time for indexing in IndexWriter2
> increases
> when we increase maxDocsInRam above 1000 records?
> I assume that the OS maintains some write buffer in memory, which can
> hold up to 5000 of my records. When
> maxDocsInRam is big, it transfers a big segment  (of 10000
> records, for
> example) to the file system, which fills
> up that buffer. Probably on Windows these results will look much
> different. If anyone is interested, I can run
> them on Windows 2000.
>
>
> Now about the PriorityQueue object: The
> org.apache.lucene.util.PriorityQueue uses a cool partial
> ordering of its elements, which only a sick genius can invent. I have
> looked at other PRiorityQueue
> objects and they look as plain as you can imagine (I don't know about
> the one in Jakarta's Commons Collections, though).
> The org.apache.lucene.util.PriorityQueue looks like it was quickly
> ported from another language - probably C, but
> was not polished enough. For example, the internal array that
> the queue
> uses is twice bigger than necessary, which is a big waste of memory.
> Anyway, the new PriorityQueue addresses all issues, but someone has to
> incorporate it in Lucene.
>
> Regards, Ivaylo
>
>
>
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Tuesday, February 26, 2002 11:53 AM
> To: Lucene Developers List
> Subject: Re: new version of IndexWriter.java
>
>
> Ivaylo,
>
> Thanks for the contribution.  It sounds good, although I
> haven't looked
> at it yet.  Do you have any performance numbers?  I'm curious how it
> compares to the original IndexWriter.
>
> As for your PriorityQueue, it's still sitting flagged in my Lucene
> folder for review.
> I've been meaning to send a reply with the following
> question, not just
> for you, but for Doug and others as well:
> Is there anything special, anything Lucene-specific in that
> PriorityQueue?  If not, there is a PriorityQueue implementation in
> Jakarta's Commons Collections sub-project which we could (re)use
> instead of having our own.  On the other hand, this requires that we
> include the collections jar in lib.
>
> Just some thoughts.
> In any case, sorry for not replying, the contribution _is_
> appreciated.
>
> Otis
>
>
> --- Ivaylo Zlatev <IZlatev@entigen.com> wrote:
> >
> > Yesterday I was inspired by the conversation on the dev. list about
> > indexing in memory, etc
> > and I wrote a new version of IndexWriter.java (it is named
> > IndexWriter2.java). Find the attached file here. The code is stable
> > and
> > worth a try. The following is from the javaDocs for this file:
> >
> > /**
> >  * IndexWriter2 is a modification of the original
> IndexWriter, coming
> >  * with lucene. It benefits from a RAMDirectory, which IndexWriter
> > has
> >  * as well. The original IndexWriter treats the segments in the
> > RAMDirectory
> >  * no different from the segments in the target directory, where the
> > index is
> >  * being built. For example, it ALWAYS merges RAMDirectory segments
> > in
> > the
> >  * target directory. Here, we optimize the usage of RAMDirectory in
> > the
> >  * following way:<br>
> >  *
> >  * When a new Document is added, a new segment for it is created in
> >  * RAMDirectory. When the RAMDirectory collects 'maxDocsInRam' (this
> > is
> > a new
> >  * important setting, the default is 10000) 1-document
> >  * segments, IndexWriter2 will merge them into one 10000-documents
> > segment into
> >  * RAMDirectory (here is a difference from IndexWriter). Then it
> > moves
> > this
> >  * segment from the RAMDirectory to the target directory (usually a
> > file
> > system
> >  * directory). This way, during indexing, IndexWriter2 will be
> > writing
> > segments
> >  * of equal size (equal to maxDocsInRam) to the target directory. In
> > other
> >  * words, during indexing only one file-system segment is opened and
> > dealt with,
> >  * which uses just a few file handles. No more "Too many open files"
> >  * exceptions.<br>
> >  *
> >  * After indexing is finished, it is good to call
> optimize() to merge
> > all
> >  * created segments into one. The RAMDirectory is out of the picture
> > here and
> >  * is not being used. Here is where we use the mergeFactor setting:
> >  * A total of mergeFactor+1 segments will be merged at once into one
> > new
> >  * segment. This happens in a loop, until only 1 segment is left.
> >  * Here you can get  to a "Too many open files" exception, if your
> > mergeFactor
> >  * is large. If you set mergeFactor to 1, it will merge only 2
> > segments
> > at a
> >  * time, which will preserve the file handles, but will be a bit
> > slower
> > than
> >  * a merge with  mergeFactor, for example.<br>
> >  *
> >  * At the end of mergeSegments() originally there was a code, where,
> > if
> > a
> >  * segment file can't be deleted (because it's currently opened in
> > Windows),
> >  * it stores it's name in a file, named 'deletable', so that it can
> > try
> > to
> >  * delete it later. I believe there was some bug with not
> closing the
> > merged
> >  * segments properly, which was the reason for all of this. Anyway,
> > now
> > there
> >  * are no problems with deleting these files on Windows and
> therefore
> > the code,
> >  * reading and writing to the 'deletable' file is commented out.<br>
> >  *
> >  * @author Ivaylo Zlatev (ivaylo_zlatev@yahoo.com)
> >  */
> >
> >
> > Two weeks ago I sent an improved PriorityQueue, fixing important
> > memory
> > issues and
> > much more. I just wasted my time - no response at all.
> Hopefully this
> > time my code will be more useful.
> >
> > Regards, Ivaylo
> >  <<IndexWriter2.java>>
> >
>
> > ATTACHMENT part 2 application/octet-stream name=IndexWriter2.java
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Greetings - Send FREE e-cards for every occasion!
> http://greetings.yahoo.com
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic