Hello,

 see http://sources.redhat.com/ml/libc-alpha/2002-02/msg00107.html for 
details. In short, we're using malloc() very extensively, and a noticeable 
part of the execution time is spent handling dynamic allocations. Which means 
that malloc() should be very very fast, and if it's not, this affects overall 
KDE performance. The problem is we're linking against -lpthread, which makes 
malloc() use a mutex for locking (even though most KDE apps aren't actually 
threaded at the present time), and this makes malloc() to be not that very 
very fast.

 I tried to do some benchmarks, and I e.g. managed to reduce time needed for 
fully rendering $QTDIR/doc/html/functions.html from 60s to 39s (30%) by 
LD_PRELOAD-ing a different malloc() implementation (Doug Lea's malloc), which 
I also tweaked a bit. Real world cases are a bit difficult to measure, but 
the improvement should be at least 10% everywhere.

 This is only for glibc < 2.3 , I don't know about other systems. Also, with 
the current glibc CVS (i.e. the yet to be released glibc-2.3), malloc() uses 
already a spinlock instead of a mutex, and it has almost the same performance 
as my tuned malloc().

 I'm going to include this malloc() implementation in libkdecore, and I 
already got ok from Dirk, as long it has to be explicitly enabled by a 
configure switch. It was already discussed a bit on IRC too. In case you have 
some thoughts on this, feel free to comment. I'll describe what I exactly 
want to do.

 There will be a configure option for this, disabled by default (not enough 
time to really test it, if nothing else). It will work only with glibc, as I 
have no idea about the situation with non-glibc systems. It also requires a 
spinlock implementation (i.e. some assembler), I have right now only a x86 
one.

 However, I'd like to keep it also after glibc-2.3 is released (still only 
optional). Even with malloc() from the current glibc CVS, I can get about 5% 
improvement on the functions.html page with the tuned malloc(). Glibc 
malloc() still has malloc hooks, and is optimized for many threads (it's 
ptmalloc, which is a threaded version of Doug Lea's malloc), which is 
something we don't need. The only time I needed malloc hooks was for 
kdesdk/kmtrace, which is LD_PRELOAD-ed anyway, so it can work around it. Code 
optimized for many threads - I'd first have to see a KDE application where 
that's needed. Not to mention that I even tried the malloc() implementations 
with several threads running, and the simple spinlock only variant didn't 
perform worse than the glibc one with 4 threads doing nothing just calling 
malloc() and free() in loops (but I don't have access to SMP machine, so 
there it might be different).

 BTW, just to show how damn fast malloc() has to be: I have here also a test 
version of malloc(), which only allocates memory continuously from a large 
array and free() is empty function(practically unusable, but as close to 
no-op as possible). That functions.html example needs 35s then (vs 39s with 
the tuned malloc()). If I add 'for(int i=0;i<70;++i);' to both this malloc() 
and free(), it becomes 39s (gcc doesn't optimise out empty loops). I also 
tried to write my own malloc(), which did only a few bitfield operations and 
little pointer arithmetics - not fast enough, 10% slower then glibc-2.3 
malloc (even though it needs about 5-8% less memory, but I doubt anyone is 
going to trade that for speed).

 Having the possibility to use a malloc() tuned for KDE's needs isn't IMHO a 
thing that can break anything. I'm also going to do some improvements to 
kdesdk/kmtrace, so it will be hopefully possible to find places where we do 
so many allocations (even though I doubt we can do much about that).
 Hmm ... any thoughts?

-- 
 Lubos Lunak
 llunak@suse.cz ; l.lunak@kde.org
 http://dforce.sh.cvut.cz/~seli