[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-core-devel
Subject:    Re: KHelpCenter + htdig
From:       Andreas Pour <pour () mieterra ! com>
Date:       2001-05-16 19:36:51
[Download RAW message or body]

Stephan Kulow wrote:
> 
> On Wednesday 16 May 2001 03:09, Preston Brown wrote:
> > On Monday 14 May 2001 02:02 pm, Daniel Naber wrote:
> > > On Monday 14 May 2001 19:21, Stephan Kulow wrote:
> > > > I can tell you at least that the htdig code does some nasty assumptions
> > > > at quite some points that assume it's using a web server.
> > >
> > > Actually writing a fast and simple search isn't that difficult. If you
> > > allow me to write one in Perl, I'll do it (I'd adopt Perlfect Search). I
> > > just guess you won't like KDE requiring Perl?
> >
> > http://freshmeat.net/projects/mifluz/
> > http://freshmeat.net/projects/swish/
> OK, I looked at both now and both are not what we need. mifluz is created out
> of parts of htdig and misses the search features htdig has, but is optimzed
> on large amount of data, so exactly the opposite to what we have. We do not
> have a mass on data, but we'd need a good way to search the data we have.
> 
> swish is even less suited, as it's centered around english and latin1 that
> makes it about useless and beside that it didn't work in the sample examples
> I tried, so I'd say forget about that choice.

Hi,

I use swish++ (http://homepage.mac.com/pauljlucas/software/swish/) for
the KDE mailing lists and it is quite fast even though the database is >
300MB and it does not require a database (though it does require mmap(2)
and a decent STL).  Uhfortunately as is it is English-only, but the FAQ
indicates the only things to change for other languages is the stop
words (that is relatively easy, these are just words like "a", "the" and
"in" which should not be indexed) and the "is_ok_word()" function.

One I have been meaning to check out is ASPSeek
(http://www.aspseek.org/), which also is in C++.  Though I have not
tried it yet, it is advertised as supporting UNICODE and multiple
languages per document.  It also features phrase searching, query words
highlighting and HTML templates for search results.

> 
> I thought again and the indexing can be done very easily taking our little
> amount of data. Problem is more to have a fast way of searching. For simple
> boolean search this is no problem, but I'd like to have something like fuzzy
> search. Problem is to find a algorithm that defines if two words are "about
> the same" _indepent of the language_ and it would most likely work on all
> words. So if you do fuzzy a search on "play CD" I'd expect it to find "CD
> player", but that would require to go through all words in our documents (of
> course indexed) and look if it's similiar to "play" or to "CD".

The algorithm used for this is referred to as "stem words" or similar. 
It basically first converts something like "playing" to "play" and then
uses a dictionary (usu. ispell/aspell) to find all derived words (like
plays, playing, etc.).  Swish++ supports this, I don't know about
ASPseek. 

Hope there's something useful here.

Dre

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic