[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Faster startups by fixing C++ object files before linking  ( 30-50% faster startups )
From:       Leon Bottou <leonb () research ! att ! com>
Date:       2001-07-26 22:01:41
[Download RAW message or body]

Reading Waldo Bastian's text on C++ shared libraires gave me some ideas.  
Eventually I spent a few nights trying them.

The proposed scheme modifies the object files before linking
in a way that reduces the number of expensive relocations.
Startup times are reduced by 30 to 50%.

See the attached files for details.

Regards,

- Leon Bottou


["objprelink.c.gz" (application/x-gzip)]
["README" (text/plain)]

Waldo Bastian's document <http://www.suse.de/~bastian/Export/linking.txt>
demonstrates that the current g++ implementation generates lots of expensive
run-time relocations.  This translates into the slow startup of large C++
applications (KDE, StarOffice, etc.).  

The attached program "objprelink.c" is designed to reduce the problem. 
Expect startup times 30-50% faster.


1) HOWTO
=========

You must first compile objprelink.c as follows:

    $ gcc -O2 -o objprelink objprelink.c -lbfd -liberty

This program must be run on every object file (.o file) that 
composes the application or shared library.   

For the KDE packages, for instance, the simplest way consists of
first making a regular build.  The following commands then fix
all object files, and relink all executables and libraries.

    $ find . -name '*.o' -exec objprelink {} \;
    $ find . -name '*.lo' -exec touch {} \;
    $ make

Another approach consists in tweaking the Makefiles.
That works well for QT.




2) PRINCIPLE
=============

The name "objprelink" means that the program must be run before linking shared
libraries or executables.  I will explain the idea using Waldo's little
programs "testclassN.cpp".

-----------------------------------------------------------------
testclassN.cpp
-----------------------------------------------------------------
#include <qwidget.h>
template<int T> class testclass : public QWidget {
public:
  virtual void setSizeIncrement(int w, int h) 
    { QWidget::setSizeIncrement(w+T, h+T); }
};
template class testclass<1>;
template class testclass<2>;
....                           // as many as we want.
template class testclass<N>;
-----------------------------------------------------------------


Let's first compile this program using the regular method.

    $ g++ -c -I$QTDIR/include testclass1.cpp
    $ g++ -shared -o testclass1.so testclass1.o -L$QTDIR/lib -lqt

The resulting object file "testclass1.o" contains several section.
One section contains the virtual table for the class testclass<1>.
Here are the relocations for this section:

----------------------------------------------------------------
BEFORE (vtable relocs for testclass<1>)
----------------------------------------------------------------
RELOCATION RECORDS FOR [.gnu.linkonce.d.__vt_t9testclass1i1]:
OFFSET   TYPE              VALUE
00000004 R_386_32 __tft9testclass1i1
00000008 R_386_32 _._t9testclass1i1
0000000c R_386_32 event__7QWidgetP6QEvent
00000010 R_386_32 eventFilter__7QObjectP7QObjectP6QEvent
00000014 R_386_32 metaObject__C7QWidget
00000018 R_386_32 className__C7QWidget
0000001c R_386_32 setName__7QWidgetPCc
....
----------------------------------------------------------------

Each of these relocations require an expensive symbol lookup at run time.
There will be a relocation to function QWidget::className(..) in the vtable of
every class that inherits QWidget.  The same will happen for the 70+ virtual
functions defined by QWidget.

The "objprelink" program adds one indirection into the vtables.  It inserts a
stub section for each function appearing in vtables and moves the expensive
relocation there:

----------------------------------------------------------------
AFTER  (stub for QWidget::className)
----------------------------------------------------------------
DISASSEMBLY OF [.gnu.linkonce.t.stub.className__C7QWidget]:
00000000 <.gnu.linkonce.t.stub.className__C7QWidget>:
   0:   b8 00 00 00 00          mov    $0x0,%eax
                        1: R_386_32     className__C7QWidget
   5:   ff e0                   jmp    *%eax
----------------------------------------------------------------

All the trick is that there is only one such section per function.  This
section is shared by all the QWidget subclasses defined in this library.  
The vtable relocs are then modified to point to the stub sections.
These relocs will become R_386_RELATIVE in the shared object and
will not require a symbol lookup.

----------------------------------------------------------------
AFTER (vtable relocs for testclass<1>)
----------------------------------------------------------------
RELOCATION RECORDS FOR [.gnu.linkonce.d.__vt_t9testclass1i1]:
OFFSET   TYPE              VALUE
00000004 R_386_32 .gnu.linkonce.t.stub.__tft9testclass1i1
00000008 R_386_32 .gnu.linkonce.t.stub._._t9testclass1i1
0000000c R_386_32 .gnu.linkonce.t.stub.event__7QWidgetP6QEvent
00000010 R_386_32 .gnu.linkonce.t.stub.eventFilter__7QObjectP7QObjectP6QEvent
00000014 R_386_32 .gnu.linkonce.t.stub.metaObject__C7QWidget
00000018 R_386_32 .gnu.linkonce.t.stub.className__C7QWidget
0000001c R_386_32 .gnu.linkonce.t.stub.setName__7QWidgetPCc
....
----------------------------------------------------------------

One important point is that "objprelink" does not change the symbol table.
Undefined symbols remain undefined.  Defined symbols remain defined.  It just
changes the relocation records without modifying the linking semantic.
This is not like option -Bdynamic.



3) RESULTS
===========


The following table compares the numbers of relocations in shared libraries
generated from regular object files (before the slash) and from fixed object
files (after the slash).  Figures are provided for some testclassN programs
and also for the QT library.

------------------------------------------------------------------------------
                     R_386_32  R_386_GLOB_DAT  R_386_JUMP_SLOT  R_386_RELATIVE
------------------------------------------------------------------------------
testclass1.so         106/105       9/9             8/8            3/108
testclass2.so         212/110      13/13            8/8            3/213
testclass5.so         530/125      25/25            8/8            3/528
testclass10.so       1060/150      45/45            8/8            3/1053
testclass20.so       2120/200      85/85            8/8            3/2103
testclass50.so       5300/350     205/205           8/8            3/5253
------------------------------------------------------------------------------
libqt.so            16915/4563   2690/2690        5039/5039      4933/21669
------------------------------------------------------------------------------

Basically it transforms a large number of expensive R_386_32 relocations into
comparatively cheap R_386_RELATIVE relocations.  This is a gain because it
reduces the number of symbol lookups during the dynamic loading.

The following table gives the execution time of an empty main function
dynamically linked with the above shared libraries.  Units are milliseconds
averaged over one hundred runs.

----------------------------------------------------------------
libqt.so              regular     regular    prelink    prelink   
testclass*.so         regular     prelink    regular    prelink   
----------------------------------------------------------------
testclass1.so           60          61         41         40    
testclass2.so           63          62         40         40    
testclass5.so           62          63         41         40    
testclass10.so          64          63         43         40
testclass20.so          67          64         45         42
testclass50.so          74          68         54         45
----------------------------------------------------------------

This shows a 30% improvement when everything gets prelinked.  
I made a few additional measurements using LD_DEBUG=statistics.
These indicate even larger improvements.

I am progressively recompiling the C++ library on my system.  Yesterday night
I recompiled "libqt.so.2.3.1" and installed it.  Then I recompiled
"libqtcups.so" and observed dramatic speedup in the startup time of "qtcups".
These tests provide extensive coverage of the virtual table modifications.


My initial plan consisted in using a R_386_PLT relocation in the stub
sections.  This would buy me lazy symbol binding and even faster startup
times.  This is trickier than it looks because one should not jump into the
PLT without the proper got pointer in %ebx.  I have not been able to achieve
this with an acceptable overhead.  Any ideas ?



>> Visit http://master.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic