'[KinoSearch] the state of the onion'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kinosearch
Subject:    [KinoSearch] the state of the onion
From:       marvin () rectangular ! com (Marvin Humphrey)
Date:       2006-03-08 14:38:00
Message-ID: FF7AE634-6516-4046-AE0A-508215F202DB () rectangular ! com
[Download RAW message or body]


On Mar 8, 2006, at 4:38 AM, Hundt, Richard wrote:

> KinoSearch looks sexeh, except that we're not too keen
> on tying ourselves into unstable libraries - `alpha' being a bit  
> scary,

The main reason that KinoSearch has an alpha label on it is that the  
Perl5/CPAN model for including libraries makes it impossible to make  
backwards-incompatible changes without causing significant  
disruptions or releasing in a new namespace.  For compile-first  
languages like Java and C, a major version increment introducing an  
incompatible API change may be a pain to integrate, but it's less  
likely to trigger immediate catastrophic failure of somebody's live  
app immediately upon installation.

KinoSearch isn't a new project; it's been in development off and on  
(mostly on) for a year and a half.  The version number may be 0.06,  
but you could argue we're really approaching 2.0.  Search::Kinosearch  
0.02x really was 1.0, and although it had an alpha label on it too,  
it was a pretty reasonable library for something that ambitious: a  
bit buggy and the index files were too big, but workable.  There's  
actually an unreleased branch of Search::Kinosearch, 0.03, with a  
reworked file format, improved speed, and an Inline::C scoring  
algorithm.  It was a week or two away from release when Larry Wall  
convinced me I should "seek convergence" with Lucene.

My first go at convergence was an attempt to fix Plucene's  
performance problems by hacking in pieces of Search::Kinosearch.   
That odyssey is chronicled in the archives of the Plucene mailing  
list from last August/September.  It ended when I concluded that  
Plucene is saddled with a fundamental architectural issue which  
precludes the needed quantum speed improvements barring a from- 
scratch rewrite.

The present KinoSearch started out as that rewrite, using Lucene ~1.9  
as a skeleton.  You can't just translate Lucene faithfully into Perl,  
though; it needs a *lot* of changes.  Fortunately, this isn't the  
first time I've written a search engine library for Perl, so good  
alternatives are known in nearly every case, and if I thought there  
was a better way of doing something, that's what got used.  Case in  
point: Lucene's Highlighter doesn't deal with phrases properly;  
KinoSearch's Highlighter, which was adapted from the  
Search::Kinosearch 0.03 branch, does.

Here's how the various engines stack up for indexing 1000 Wikipedia  
docs on my G4 laptop:

Plucene 1.24                270 secs
Search::Kinosearch 0.021     88 secs
Search::Kinosearch 0.03_02   35 secs
KinoSearch 0.06              17 secs
Java Lucene (~1.9 from svn)   9 secs

> but probably a fact, seeing a few stub'ish looking modules in there.

The stub'ish modules are MultiReader, MultiTermDocs, SegMerger, and  
DelDocs.  All are involved in multi-segment invindexes and  
incremental indexing, which I'm working on right now.

> do you have any idea how long it may take to pull it out of alpha;

I'm planning on drawing it out for a few months longer.  Were it not  
for the Perl5/CPAN versioning dilemma, I'd be tempted to remove the  
alpha label and declare 1.0 within a release or two, then work within  
the normal major/maintenance release framework, saving backwards- 
incompatible changes to the API and the file format for major version  
increments.  However, since major version increments and CPAN don't  
play well together, the current plan is to wait until a fair number  
of users have battle tested the interface, revising bit by bit as  
needed.  When 1.0 comes out, it should be solid.

Ideally, 1.0 will be solid enough that that KinoSearch won't see a  
back-compatibility-breaking 2.0 release in the Perl5 era.  That  
depends a little bit on how important it is to stay in sync with  
Lucene.  It isn't in sync now, so that's gravy.

The main item still in flux is the file format, which changed between  
0.05 and 0.06 and will change again.  An excellent bit of news: Doug  
Cutting, the original author of Lucene, filed an "improvement"-type  
bug-report a week ago indicating that Lucene is targeting byte-count  
based Strings for version 2.1.  I'm the person who's done the most  
work on that, so it will probably be up to me to make it happen.  If  
it does, then Lucene 2.1 / KinoSearch 1.0 index file format  
compatibility can become a reality... once Lucene 2.1 is released,  
which is ??? (another potential reason to keep the alpha label).

> and do you have a growing user base
> who can understand the point of doing it, when there's Plucene, who
> would keep the project alive if you lose froopy for it, or get a full
> time job which simply sucks up too much time.

We don't have a lot of bodies yet.  People have told me they are  
waiting for two things: incremental indexing, and removal of the  
alpha label.  Once incremental indexing is implemented, I think we'll  
see more activity.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic