On Thursday 03 November 2005 23:42, Bastian, Waldo wrote: > http://sourceware.org/ml/binutils/2005-10/msg00436.html Obviously this patch is not really welcome ;-( Of course prelinking is better, but often not possible or inpracticable. I noticed myself in profile experiments that symbol lookup often is really a cache burden, both adding latency because of main memory accesses and evicting quite some data from L1/L2, leading to subsequent cache misses. This stems from the fact that a symbol lookup does a hash table lookup for every shared library loaded (until the symbol is found). One hash table lookup on a P4 loads 64 bytes into L1 (and 128 byte into L2), but only uses 4 bytes of it. For every such lookup, additionally a string compare has to be started, leading to eviction of another cache line. Now take the number of shared libs for a KDE application (or OO) and hash conflict cases into account, you probably get around 15 hash lookups per symbol lookups on average, summing up to 2kB evicted from a 8KB L1 cache (on P4). Note that this 2kB most often has to be loaded from main memory, leading to quite some slowdown because of memory access latency. It probably would be better to dynamically build up a persistant, fixed size and mmapable large hash table for the most often used symbol lookups in a system, which would allow to satisfy a symbol lookup with one hash table lookup only. For this, it probably would be good to make a hash table entry 32 or 64 byte, and fill the space with the symbol names, such that one cache line load is probably enough for both the lookup and the string comparision. The hash table size itself will grow, but this is not important as almost every hash table access usually leads to a main memory access, and it is better to use the data which is loaded either way. Any takers? To be fair, I can not predict any speedup reachable with this technique, but it should match the above mentioned -Bdirect patch, without changing the linker and thus, existing binaries. It seems to be a good idea, and we should have real numbers. If you read about this low-level boring cache stuff until here :-), I have something to do your own experiments: Callgrind has an option --cacheuse=yes to enable further cache use statistics. It will provide you with the amount of data which was loaded into cache without ever being really used. I call this metric "spatial loss", abreviated SpLoss1 for L1 and SpLoss2 for L2. You should compare these values with the total amount of data being loaded into L1 or L2 to get an idea of the percentage of transfer bandwidth wasted: E.g. the total amount of data loaded into L2 is the number of L2 cache misses multiplied with the L2 cache line size. In KCachegrind, you would write "64 * L2m" as formula for a new event type (the simulator uses a 64 byte L2 cache line for a P4). This way, you can directly compare total vs. wasted loads side by side. Be aware: Using "--cacheuse=yes" produces profile data with 13 event types. Unfortunately, KCachegrind from KDE 3.4.x has a hard coded limit of 10 for the maximal numbers of event types it can load, and produces a nonsense error message like kcachegrind: ERROR: No event line found. Skipping './callgrind.out.22474 when you try to load such files. In KDE 3.5, I changed this to 13. For KDE 3.4.x, you need to modify the line #define MaxRealIndexValue 10 in kdesdk/kcachegrind/kcachegrind/tracedata.h, class TraceCost, to be able to load these profile data files. Josef _______________________________________________ Kde-optimize mailing list Kde-optimize@kde.org https://mail.kde.org/mailman/listinfo/kde-optimize