From kmail-devel Mon Feb 28 06:26:14 2005 From: Don Sanders Date: Mon, 28 Feb 2005 06:26:14 +0000 To: kmail-devel Subject: Re: Full Text Indexing for kmail Message-Id: <200502281626.15426.sanders () kde ! org> X-MARC-Message: https://marc.info/?l=kmail-devel&m=110957231516312 On Friday 25 February 2005 23:19, Luís Pedro Coelho wrote: > On Friday 25 February 2005 07:21, Don Sanders wrote: > > Yeah, I felt it used up too much ram. How much ram (e.g. for > > fifty 100 megabyte folders each containing 25,000 messages) would > > this new indexer take in ram and disk space? > > Right now, it works using mmap() and the disk format is the same as > the memory format. Therefore it (1) takes almost no time to load > and (2) doesn't take almost any RAM. Excellent. > The size of the whole index is just under the size of the text. That sounds fine. > Keeping in mind that in email, I am not looking at either the > headers, nor the attachments, it is about a third of the original > mailbox size. > > I am working on a version which uses libz to compress and > decompress the index as-needed. This version would need a bit more > RAM to keep the decompressed pages, but still not much (haven't > measured) and probably tunable, but less disk (of course). > > I think that the compression scheme I am working on can seriously > reduce disk space usage since gzip itself compresses the index very > well. Ok, that would be a nice plus I guess but having the index be the same size as the input text (for a full text index) seems perfectly adequate/normal/fine to me. > > This new indexer can incrementally index documents? > > Yes. Does it support phrase searching? (e.g. searching for "hot dog" returns documents containing the phrase "hot dog" rather than just documents containing "hot" and "dog", I expect a real full text indexer to be able to do this). How about, aah, wildcard searches like "index*" how about "*index"? I ask because that's the way the current search functionality works (typing "kmfolderind" in the search dialog will find all documents/emails containing the substring "kmfolderind"). Another problem Sam's indexer had/has is that the first time it was used it took a long time to index all the mail, and quite some CPU. We did work in kmfolderindex.* to prevent the UI from blocking while indexing is in progress. Does (or can) your indexer index existing documents? Or does it only index new mail as it arrives? If it indexes existing documents then is KMail usable while it is indexing? How robust is it? Does the file on disk get corrupted easily in the result of a KMail crash? If the index on disk does get corrupted is it able to easily throw it away and start again? (Unfortunately some of our uses don't always shut down there machines cleanly and files do get corrupted). If there's a crash how can you know if it missed indexing some files? > > Maybe look at kmfolderindex.*. > > I will, thanks. No problem, I'm really interested in learning more about your work. If your code gets committed will you be around to maintain the code ( I mean read bug reports and see if they are valid and you can reproduce the problem )? Again it sounds interesting. There is a desire for a general KDE indexer, but I'm not so concerned with that. Having a reliable, powerful full text indexer for KMail would fix one of its most serious deficiencies. Don. _______________________________________________ KMail developers mailing list KMail-devel@kde.org https://mail.kde.org/mailman/listinfo/kmail-devel