From kmail-devel  Mon Feb 28 06:26:14 2005
From: Don Sanders <sanders () kde ! org>
Date: Mon, 28 Feb 2005 06:26:14 +0000
To: kmail-devel
Subject: Re: Full Text Indexing for kmail
Message-Id: <200502281626.15426.sanders () kde ! org>
X-MARC-Message: https://marc.info/?l=kmail-devel&m=110957231516312

On Friday 25 February 2005 23:19, Luís Pedro Coelho wrote:
> On Friday 25 February 2005 07:21, Don Sanders wrote:
> > Yeah, I felt it used up too much ram. How much ram (e.g. for
> > fifty 100 megabyte folders each containing 25,000 messages) would
> > this new indexer take in ram and disk space?
>
> Right now, it works using mmap() and the disk format is the same as
> the memory format. Therefore it (1) takes almost no time to load
> and (2) doesn't take almost any RAM.

Excellent.

> The size of the whole index is just under the size of the text.

That sounds fine.

> Keeping in mind that in email, I am not looking at either the
> headers, nor the attachments, it is about a third of the original
> mailbox size.
>
> I am working on a version which uses libz to compress and
> decompress the index as-needed. This version would need a bit more
> RAM to keep the decompressed pages, but still not much (haven't
> measured) and probably tunable, but less disk (of course).
>
> I think that the compression scheme I am working on can seriously
> reduce disk space usage since gzip itself compresses the index very
> well.

Ok, that would be a nice plus I guess but having the index be the same 
size as the input text (for a full text index) seems perfectly 
adequate/normal/fine to me.

> > This new indexer can incrementally index documents?
>
> Yes.

Does it support phrase searching? (e.g. searching for "hot dog" 
returns documents containing the phrase "hot dog" rather than just 
documents containing "hot" and "dog", I expect a real full text 
indexer to be able to do this). How about, aah, wildcard searches 
like "index*" how about "*index"? I ask because that's the way the 
current search functionality works (typing "kmfolderind" in the 
search dialog will find all documents/emails containing the substring 
"kmfolderind").

Another problem Sam's indexer had/has is that the first time it was 
used it took a long time to index all the mail, and quite some CPU. 
We did work in kmfolderindex.* to prevent the UI from blocking while 
indexing is in progress.

Does (or can) your indexer index existing documents? Or does it only 
index new mail as it arrives? If it indexes existing documents then 
is KMail usable while it is indexing?

How robust is it? Does the file on disk get corrupted easily in the 
result of a KMail crash? If the index on disk does get corrupted is 
it able to easily throw it away and start again? (Unfortunately some 
of our uses don't always shut down there machines cleanly and files 
do get corrupted). If there's a crash how can you know if it missed 
indexing some files?

> > Maybe look at kmfolderindex.*.
>
> I will, thanks.

No problem, I'm really interested in learning more about your work. If 
your code gets committed will you be around to maintain the code ( I 
mean read bug reports and see if they are valid and you can reproduce 
the problem )?

Again it sounds interesting. There is a desire for a general KDE 
indexer, but I'm not so concerned with that. Having a reliable, 
powerful full text indexer for KMail would fix one of its most 
serious deficiencies.

Don.
_______________________________________________
KMail developers mailing list
KMail-devel@kde.org
https://mail.kde.org/mailman/listinfo/kmail-devel