[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kmail-devel
Subject:    Re: CIA proposal (was: ClientInterface)
From:       Don Sanders <sanders () kde ! org>
Date:       2003-07-23 4:58:10
[Download RAW message or body]

On Monday 21 July 2003 21:31, Marc Mutz wrote:
> Hi!
>
> > It's not watching the files that is the main problem. The problem
> > is retaining the integrity of objects that point into files which
> > can be changed by other processes.
>
> I was asked to provide a concept for a server-less solution to the
> CIA (Concurrent Index Access) problem. Here is it. It's based on
> the idea from LinuxTag that's described here as Step 3. It solves
> the memory load problems with index files by having them shared
> between KMail instances running on the same machine.
>
> Step 1: Require Maildir
> Rationale:
>   - Changes in maildir folders can be efficiently monitored with
>     KDirWatch (which can use DNotify/FAM). In particular, changes
>     are monitored per-file. This is not the case for mbox.
>   - In particular, changes to stati result in renaming the file.
>
> Step 2: Observe that the Index file format allows most changes
> (esp. all status updates) to be performed in-place.
>
>
> Step 3: Don't create an in-memory representation of index entries
> ("KMMsgInfo::kd") anymore. Instead, let every entry just be a
> pointer to the beginning of it's mmaped region in the index file.
> For this to be efficient, use Don's idea of storing those offsets
> in a file of their own. Every access to information stored in the
> index accesses the index file data directly.
> Rationale:
>   - This allows most operations on exisiting messages to be
>     transparently shared across all KMail instances; and in an
> extremely efficient way (memory-, cpu- and network-load-wise). This
> holds esp. for the common operations of changing the various stati
> of messages.
>
> Step 4: Define a protocol by which a given Kmail instance can
> obtain write access to a given folder/index file.
>
> Sketch of the CIA/4 protocol:
>
> KMail instance A wants to obtain write access to folder F from B:
>
> 1. A tries to create folder-lockfile L_F
>    Success: Write own DCOP id into the lock file.
>             A has now (exclusive) write access to F.
> 2. A tries to read a DCOP ID of B from L_F
>    Success: A sends B the message
>             "Please release folder F"
>             Wait a random amount of time, then goto 1.
>    Failure: Wait a random amount of time, then goto 1.
>    Repeated failure: Tell user.
>
> Observe how B is not required to close the folder or to unmmap the
> index file. It marks the folder read-only internally (or remaps the
> index to be ro).
>
> An instance holding the write lock is required to
>
> a. Release it as soon as practicable
> b. Release it as soon as possible if the above DCOP message arrives
> c. never alter existing index entries' length
>
> If a change in entry length is necessary, mark the old one as
> deleted, rename the maildir file and add a new entry at the end.
> Other instances will see a delete and add through the dirwatching
> and can react accordingly.
>
> CIA/4 can be extended with a DCOP broadcast call to all KMail
> instances to close F, so that A can perform (index file) compaction
> after all instances replied with a "folder F closed" message.

What I like:
1) "the Index file format allows most changes (esp. all status 
updates) to be performed in-place.", deletion is the only exception I 
think. I'm all for changing deletion so that an index entry is marked 
as deleted only and then later a cleanup operation, 'compaction' is 
performed.

I'd prefer it if a timer was used to perform index file cleanup rather 
than waiting for exit, but that will add extra complexity, (I omit 
the details of that complexity).

2) The KMail instances are talking to each other via DCOP, so rather 
than a client only approach in my opinion it's a peer-to-peer network 
of servers approach. So I think there's some realization here that 
IPC is sensible.

3) I think this is a realistic suggestion for handling the problem of 
multiple clients accessing ~/Mail concurrently. Theoretically I think 
it might work, but I'm unsure about whether the mmap will hold up.

Slightly pro:
1) 'Step 3: Don't create an in-memory representation of index entries
     ("KMMsgInfo::kd") anymore. Instead, let every entry just be a  
     pointer to the beginning of it's mmaped region in the index    
     file.'
Well as every entry does have a pointer to the beginning of it's 
mmaped region I read this as simply drop the in-memory 
representation.

I'm uncertain of the performance implications of this change, I guess 
they will be implementation specific. Some fields like the status 
field of a message are always read. This is because when KMail opens 
a folder it looks at the status of all messages to sync the count of 
unread messages and to find the next unread messages. (IIRC 1/3 of 
the time to show a folder is spent searching for the next unread 
message).

So I think it makes sense to always cache the status field, given this 
for optimum performance it makes sense to cache it as the index file 
is read in sequentially, rather than later caching it by randomly 
accessing the file in some arbitrary order, (I hope that makes sense, 
I'm skipping over some details here).

Or better yet if you are going to index the index file, please 
consider caching the status in the index of the index.

There might also be performance implications to do with sorting and 
the caching of sort keys. Specifically kmheaders might be slow when a 
new message arrives or when the sort order is changed.

But since the index file is now mmap'd, and therefore cached, I think 
it makes sense to experiment with dropping/reducing the in-memory 
representation.

Neutral/Aside:
1) Be careful of the inbox folder, I think there is a design flaw in 
filtering. We currently use a inbox folder as a temporary storage 
location for incoming mail but I think it would make more sense to 
use a private folder. I say this as a private folder for incoming 
mail could be modified quickly without the need for every client to 
be informed of the changes.

2) I'm unsure how the index file mmap will hold up. If a client has to 
close the index file then i think the mmap will be lost, and I think 
losing the mmap and having no in-memory representation of index 
entries would be critically bad, performance wise. But if you can 
keep the mmap then effectively all the caching of index entries has 
been pushed into the operating system kernel via the mmap, with luck 
the os is designed in such a way that each client can share this ro 
mmap cache, and the operating system kernel itself is being used as 
the KMail server, which would be cool.

I do wonder if multiple processes ro mmap the same file whether the 
memory used to back the mmap is shared by all those processes.

My concerns:
1) This doesn't actually address the issue of external mail clients 
modifying ~/Mail does it? I mean not even for MailDir as this 
approach requires KMail to be running to detect changes in ~/Mail.

2) I'm concerned that this approach could be fragile. A client might 
die after deleting/inserting a Maildir message but before updating 
the index, or vice versa. Maybe this can be handled robustly so this 
is just a concern not a concrete criticism. But clearly this method 
does rely on mbox style locking rather than maildir style elimination 
of the need for locking.

3) If this approach works then it does address the criticism of having 
multiple clients modify index files but there's still the memory 
costs associated with this approach. Each client will need there own 
kmmsgdict, IIRC each entry in the kmmsgdict maps a 
sernum -> (folder , index ) which is an int -> (int, int) mapping or 
12 bytes. I have >500K messages currently, so that's >6MB per client 
instance.

And then there's the full text index. A full text index is by it's 
nature a large object as it's space/time tradeoff. For me it's > 
24MB.

Besides folder files there are also other files than need to be 
handled. The config file is a problem if it is desirable to have each 
client keep their config dialogs and general configuration info in 
sync.

Another set of files is the list of mails that are kept on the server 
for each account.


In summary:

 I like the idea of having all index file operations non-destructive, 
specifically this means making deletion merely mark an index entry as 
deleted and then later performing an index compaction. I also like 
the idea of experimenting with dropping the in-memory caching of 
index entries, I'd like to see profiling stats on doing that. I think 
both these tasks make sense to do even if a client/server approach is 
ultimately taken.

I think this CIA proposal addresses (or at least attempts address) the 
key criticism I've made against the client only model. Which is 
having multiple clients work with the same index files. But I remain 
skeptical especially as to whether the mmap of index files can be 
retained, I think dropping both the mmap (or frequent re-mmaping) and 
dropping the in-memory cache would be a bad idea, (because of the 
performance implications).

I still think overall all a client/server approach makes the most 
sense. This is because I think there is a demand for KMail to provide 
database like services to other (KDE) apps. Namely I'm thinking of 
services to track messages (KMMsgDict) and search messages 
(KMMsgIndex). Implementing these services efficiently requires using 
substantial amounts of memory, and it makes sense to concentrate the 
cost of that memory use in one process rather than duplicate it 
across multiple processes.

In conclusion personally I'm ok with step 2 and are interested in the 
results of step 3 of this CIA proposal and think they can be 
performed in parallel with the Kernel/GUI separation of the 
client/server approach. Also perhaps similar to step 3 of the CIA 
proposal I hope to do more profiling work on my zero-copy display of 
messages idea (now that zero-copy parsing is implemented). I hope 
zero-copy display of messages will address what I expect to be the 
key bottle neck between the client/server, which is gross duplication 
of attachment data.

Don.
_______________________________________________
KMail Developers mailing list
kmail@mail.kde.org
http://mail.kde.org/mailman/listinfo/kmail
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic