'Re: Full Text Indexing for kmail'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kmail-devel
Subject:    Re: Full Text Indexing for kmail
From:       Luís_Pedro_Coelho <luis () luispedro ! org>
Date:       2005-03-16 15:57:18
Message-ID: 200503161557.19738.luis () luispedro ! org
[Download RAW message or body]

On Wednesday 16 March 2005 07:29, Don Sanders wrote:
> > Longer answer: It can cheat on this. I have a mode where the
> > indexer saves a complete copy of the text. Then if you search for
> > "hot dog", it retrieves all the documents where hot and dog appear
> > and then searches in them for "hot dog." Not really a bad approach,
> > actually, but a tighter integration with kmail might remove the
> > need to keep a copy.
>
> Perhaps in the index you store the id (serial number) of the document
> (email) each term (word) is in. If so another alternative is to store
> the document id and the index of the term (word) within the document.
>
> Does that make sense?

Yes, it does.

I just store the id of each document the word is in. I do not store the index 
inside the document since I believe that would more or less double the index 
size.
Keeping a copy of the text is actually a very workable solution. It is fast 
enough and the space requirements aren't that big of a problem in email (the 
whole index is _so much smaller_ than the original files).

> > > How about, aah, wildcard searches
> > > like "index*"
> >
> > Yup. It now always does search* to allow as-you-type searching
> > (it's almost fast enough)
> >
> > > how about "*index"?
> >
> > Nope. This is much harder with the current index.
>
> I see. That is a bit disappointing also I think. I guess Sam's indexer
> uses up memory because it loads all the terms into memory so that it
> can support this kind of search.

There is a way to do *index, but not *index* (this is much, much harder: I 
don't think any known index based on terms is capable of this. Suffix trees 
can, but they take a lot more space).

Right now all the terms are in an array. There is another array which indexes 
this one according to the lexicographical order, supporting binary search. To 
find "index*", one needs to look at all terms between "index" and "indexZZZ". 
To support *index one could index the strings as if they were reversed (so, 
first all the string that _end_ in A, then the strings that _end_ in B ...) 
and use the same technique.

> Would it be possible for someone to modify your code so that it can
> iterate through all terms that have been indexed and create a list of
> matching terms?

I don't really understand this.

> Ok, if you would like to post some code that would be nice.

I will try to have something working by the weekend and will post it 
somewhere.

ciau,
luispedro
_______________________________________________
KMail developers mailing list
KMail-devel@kde.org
https://mail.kde.org/mailman/listinfo/kmail-devel
[prev in list] [next in list] [prev in thread] [next in thread]