[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-core-devel
Subject:    Re: KHelpCenter + htdig
From:       Holger Schurig <holgerschurig () gmx ! de>
Date:       2001-05-16 20:29:49
[Download RAW message or body]

> > ... but I'd
> > like to have something like fuzzy search. Problem is to find a
> > algorithm that defines if two words are "about the same" _indepent
> > of the language_ and it would most likely work on all words. So if
> > you do fuzzy a search on "play CD" I'd expect it to find "CD
> > player", but that would require to go through all words in our
> > documents (of course indexed) and look if it's similiar to "play"
> > or to "CD".

A language independent similarity approach is the Levenshtein 
algorithm, also dubbed as Edit Distance. It measures the number of 
letter additions, deletions and replacements to come from one word to 
another word. Because it works on letters, it doesn't care for the 
language --- it's therefore much better then the often referenced 
Soundex algorithm, which is really crap for languages <> english.

I once used that algorithm to make a "fuzzy receiver matching" tool for 
sendmail 
(http://home.nikocity.de/hschurig/similarreceiver.html).


However, that would need really to check your entered word against the 
all existing words in the database (not necessarily in all the 
documents, most search engines have a word index anyway).

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic