'Perl progress'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Perl progress
From:       Marvin Humphrey <marvin () rectangular ! com>
Date:       2005-09-27 21:45:07
Message-ID: C598BAF0-739E-4FB3-BBA5-143B841A9771 () rectangular ! com
[Download RAW message or body]

Greets,

A week ago, a revamped indexer based on my Perl search engine  
library, KinoSearch, successfully built a Lucene-compatible index.   
The corpus was 1,000 documents from Wikipedia.

Better, it did so in a reasonable amount of time:

Time to index 1000 docs on my G4 laptop
=======================================
Plucene 1.25                   270 secs
KinoSearch 0.05_02              20 secs
Java Lucene                      9 secs

There are a number of fundamental architectural differences between  
KinoSearch and Lucene, and by extension between KinoSearch and  
Plucene, which is largely a faithful port of Java Lucene.  The most  
important of these is the merge model, which I plan to address in a  
separate post, but briefly: Lucene builds miniature-inverted-indexes  
for each document, then merges them into ever larger indexes on a  
schedule determined by mergeFactor.  KinoSearch builds indexes one  
segment at a time, and no coherent mini-inverted-index ever exists  
which is smaller than a segment.

Two other important differences:

1) KinoSearch requires that all fields be defined in advance when  
creating a segment.  The Documents which you add may not contain  
fields which have not been declared, and you cannot update the  
definition of a field once it is set.  Segments with differing field  
defs can be reconciled -- you just can't change up a def in the  
middle of creating a segment. Additionally, KinoSearch will not merge  
fields with the same fieldname -- it will overwrite.  Insisting on  
rigid field definitions means that the KinoSearch equivalents of  
FieldInfos, DocumentWriter, FieldsWriter, FieldInfosWriter,  
TermInfosWriter and such can all be instantiated once per segment,  
rather than once per document; in Perl, with its comparatively  
sluggish OO implementation, that adds up.

2) Analyzers in KinoSearch deal with batches of tokens rather than  
streams.  The concept of a TokenStream simply does not translate into  
efficient        Perl.

It may be possible to squeeze more speed out of this indexer, but  
that's no longer a top priority.  Top priority is: adapt KinoSearch's  
search modules to work with the Lucene file format.  After that, the  
goal will be to implement a limited, maintainably small subset of  
Lucene's functionality.  For instance, I only plan to support  
composite indexes written by Lucene 1.9 (or whatever version starts  
writing valid UTF-8) or later.

The code is still a little messy, but if you'd like to snoop it, you  
have the option of either a tarball or viewcvs from here:

http://www.rectangular.com/kinosearch/

I'll be directing attention towards one particular section of code in  
my post on merge models.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic