Waldo Bastian's document demonstrates that the current g++ implementation generates lots of expensive run-time relocations. This translates into the slow startup of large C++ applications (KDE, StarOffice, etc.). The attached program "objprelink.c" is designed to reduce the problem. Expect startup times 30-50% faster. 1) HOWTO ========= You must first compile objprelink.c as follows: $ gcc -O2 -o objprelink objprelink.c -lbfd -liberty This program must be run on every object file (.o file) that composes the application or shared library. For the KDE packages, for instance, the simplest way consists of first making a regular build. The following commands then fix all object files, and relink all executables and libraries. $ find . -name '*.o' -exec objprelink {} \; $ find . -name '*.lo' -exec touch {} \; $ make Another approach consists in tweaking the Makefiles. That works well for QT. 2) PRINCIPLE ============= The name "objprelink" means that the program must be run before linking shared libraries or executables. I will explain the idea using Waldo's little programs "testclassN.cpp". ----------------------------------------------------------------- testclassN.cpp ----------------------------------------------------------------- #include template class testclass : public QWidget { public: virtual void setSizeIncrement(int w, int h) { QWidget::setSizeIncrement(w+T, h+T); } }; template class testclass<1>; template class testclass<2>; .... // as many as we want. template class testclass; ----------------------------------------------------------------- Let's first compile this program using the regular method. $ g++ -c -I$QTDIR/include testclass1.cpp $ g++ -shared -o testclass1.so testclass1.o -L$QTDIR/lib -lqt The resulting object file "testclass1.o" contains several section. One section contains the virtual table for the class testclass<1>. Here are the relocations for this section: ---------------------------------------------------------------- BEFORE (vtable relocs for testclass<1>) ---------------------------------------------------------------- RELOCATION RECORDS FOR [.gnu.linkonce.d.__vt_t9testclass1i1]: OFFSET TYPE VALUE 00000004 R_386_32 __tft9testclass1i1 00000008 R_386_32 _._t9testclass1i1 0000000c R_386_32 event__7QWidgetP6QEvent 00000010 R_386_32 eventFilter__7QObjectP7QObjectP6QEvent 00000014 R_386_32 metaObject__C7QWidget 00000018 R_386_32 className__C7QWidget 0000001c R_386_32 setName__7QWidgetPCc .... ---------------------------------------------------------------- Each of these relocations require an expensive symbol lookup at run time. There will be a relocation to function QWidget::className(..) in the vtable of every class that inherits QWidget. The same will happen for the 70+ virtual functions defined by QWidget. The "objprelink" program adds one indirection into the vtables. It inserts a stub section for each function appearing in vtables and moves the expensive relocation there: ---------------------------------------------------------------- AFTER (stub for QWidget::className) ---------------------------------------------------------------- DISASSEMBLY OF [.gnu.linkonce.t.stub.className__C7QWidget]: 00000000 <.gnu.linkonce.t.stub.className__C7QWidget>: 0: b8 00 00 00 00 mov $0x0,%eax 1: R_386_32 className__C7QWidget 5: ff e0 jmp *%eax ---------------------------------------------------------------- All the trick is that there is only one such section per function. This section is shared by all the QWidget subclasses defined in this library. The vtable relocs are then modified to point to the stub sections. These relocs will become R_386_RELATIVE in the shared object and will not require a symbol lookup. ---------------------------------------------------------------- AFTER (vtable relocs for testclass<1>) ---------------------------------------------------------------- RELOCATION RECORDS FOR [.gnu.linkonce.d.__vt_t9testclass1i1]: OFFSET TYPE VALUE 00000004 R_386_32 .gnu.linkonce.t.stub.__tft9testclass1i1 00000008 R_386_32 .gnu.linkonce.t.stub._._t9testclass1i1 0000000c R_386_32 .gnu.linkonce.t.stub.event__7QWidgetP6QEvent 00000010 R_386_32 .gnu.linkonce.t.stub.eventFilter__7QObjectP7QObjectP6QEvent 00000014 R_386_32 .gnu.linkonce.t.stub.metaObject__C7QWidget 00000018 R_386_32 .gnu.linkonce.t.stub.className__C7QWidget 0000001c R_386_32 .gnu.linkonce.t.stub.setName__7QWidgetPCc .... ---------------------------------------------------------------- One important point is that "objprelink" does not change the symbol table. Undefined symbols remain undefined. Defined symbols remain defined. It just changes the relocation records without modifying the linking semantic. This is not like option -Bdynamic. 3) RESULTS =========== The following table compares the numbers of relocations in shared libraries generated from regular object files (before the slash) and from fixed object files (after the slash). Figures are provided for some testclassN programs and also for the QT library. ------------------------------------------------------------------------------ R_386_32 R_386_GLOB_DAT R_386_JUMP_SLOT R_386_RELATIVE ------------------------------------------------------------------------------ testclass1.so 106/105 9/9 8/8 3/108 testclass2.so 212/110 13/13 8/8 3/213 testclass5.so 530/125 25/25 8/8 3/528 testclass10.so 1060/150 45/45 8/8 3/1053 testclass20.so 2120/200 85/85 8/8 3/2103 testclass50.so 5300/350 205/205 8/8 3/5253 ------------------------------------------------------------------------------ libqt.so 16915/4563 2690/2690 5039/5039 4933/21669 ------------------------------------------------------------------------------ Basically it transforms a large number of expensive R_386_32 relocations into comparatively cheap R_386_RELATIVE relocations. This is a gain because it reduces the number of symbol lookups during the dynamic loading. The following table gives the execution time of an empty main function dynamically linked with the above shared libraries. Units are milliseconds averaged over one hundred runs. ---------------------------------------------------------------- libqt.so regular regular prelink prelink testclass*.so regular prelink regular prelink ---------------------------------------------------------------- testclass1.so 60 61 41 40 testclass2.so 63 62 40 40 testclass5.so 62 63 41 40 testclass10.so 64 63 43 40 testclass20.so 67 64 45 42 testclass50.so 74 68 54 45 ---------------------------------------------------------------- This shows a 30% improvement when everything gets prelinked. I made a few additional measurements using LD_DEBUG=statistics. These indicate even larger improvements. I am progressively recompiling the C++ library on my system. Yesterday night I recompiled "libqt.so.2.3.1" and installed it. Then I recompiled "libqtcups.so" and observed dramatic speedup in the startup time of "qtcups". These tests provide extensive coverage of the virtual table modifications. My initial plan consisted in using a R_386_PLT relocation in the stub sections. This would buy me lazy symbol binding and even faster startup times. This is trickier than it looks because one should not jump into the PLT without the proper got pointer in %ebx. I have not been able to achieve this with an acceptable overhead. Any ideas ?