'Complexity isolation/smarter compilers'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       eros-arch
Subject:    Complexity isolation/smarter compilers
From:       "Jonathan S. Shapiro" <shap () eros ! cis ! upenn ! edu>
Date:       1996-07-31 16:09:39
[Download RAW message or body]

[ For the benefit of the eros-arch list, the background was a
discussion on surprising changes in IPC performance under L3/L4
between Pentium and i486 that proved to be due to changes in cache
organization.  Jochen Liedtke had to reorganize both the code and the
critical data structures to compensate. ]

  [ Jay wrote: ]

  That this kind [cache reference] of tuning is required suggests to
  me one of two things: 

  1) Any software system that is this fragile and dependent on this
  level of platform knowledge, will lose to a design that is
  substantially less so.

  2) Or, systems this sensitive need to be written in a language and
  compiled by a compiler more powerful than normal C and normal C
  compilers.  Let the machine do it, not people.

I've let this percolate in my head a bit, and I think these comments
can be refined a little.

Most systems, monolithic or microkernel, have portions that are
critically sensitive to machine architecture (e.g. the bcopy
implementation). The question, then, is not "how dependent is the
software on implementation artifacts", but rather "how well isolated
are the implementation dependencies, and how many are there?"

The defining characteristic of microkernel design is not size, but
critical dependency isolation and reduction. One reason (of many) that
operating system performance has not kept pace with processor
performance is that monoliths have poorly-isolated implementation
dependencies.  Even the conservative OS implementors agree that
microkernels are easier to maintain.

The point is simply that there is a continuum to look at here.


In reference to smarter compilers, I confess that I am skeptical about
the benefits for IPC but intrigued by the benefits for microkernels
overall.  Here are some thoughts derived from the EROS IPC
implementation process concerning where our performance came from.
All of these changes are good things to do independent of cache design
and TLB particulars, but on some cache/TLB designs they are more
important than on others.  Most of them are *not* source-language
sensitive.  I apologize for the length.

1. Align the IPC path to start at a page boundary and gather all of
   the supporting code into one page (or as few as possible).

2. Specify the common case message payloads in registers.

3. Reblock the code to minimize the number of taken branches.  This
   simultaneously optimizes I-cache footprint and I-cache hit rate.

   While we did this by hand for the EROS path, it has been shown
   (MIPS, HP) that this can be done by the code generator given the
   dynamic branch taken likelihood for each branch.

   In the absence of a smart code generator, one can hand-write the C
   code (or whatever) to simply have the smart layout to begin with.

4. Design the portion of the process structure that is touched by the
   IPC path with care to cluster fields according to reference
   patterns in the critical path.

   We approximated this by hand. Some promising work in automated
   *data* reorganization was done in the PL.8 compiler, but I don't
   know to what degree this was pursued elsewhere.

5. Gather all the stuff you are going to need to touch into a single
   structure. Pointer indirection hurts a lot.  In UNIX terms,
   everything the IPC touches (aside from payload) should live in the
   moral equivalent of the 'proc' structure.

6. Do all your setup before you move the data, since the block move
   must be assumed to trash the cache.

I think it's fair to suggest that (1) and (2) are obvious to everyone
at this point.  (3) will be obvious to anyone who has dealt with
optimization and code generation.  (6) is obvious once it is pointed
out, but is mostly a matter of doing things in the right place in the
code path. The real payoff was (4) and (5).

We found that after building an IPC this way, the extended basic
blocks were completely data-dependent, that essentially all branches
were data dependent, and that the majority of branches are not taken.

We could do a little better given another register or two to work
with, but this result should hold up across machines.  On a different
architecture, the message specification would live in different
registers, and therefore will reside at a different location in the
process structure, but *the data reference pattern doesn't change*
unless the semantics of the invocation changes.

In practice, the location of the process structure in the cache is not
under OS designer control.  You can worry the alignment issue, but it
is very unlikely that the process structure and the IPC code will
block out nicely in a unified cache.

This all adds up to a curious thing: in the special case of a
completely data-dependent path that does no redundant data accesses
(in practice ours is forced on the Pentium to do 3 or 4 due to
register starvation), the optimal code and data layouts must begin at
proper cache line alignment boundaries, but otherwise are independent
of the cache structure provided that the cache has at least two sets.
Further, failure to align them correctly in such a cache causes at
most 3 marginal cache line references, which is a tiny effect.


We did a lot of tuning to the C path before we bit the bullet and
build the assembler path -- mostly because I'm stubborn.  In the end,
the assembly code wins for 3 reasons, which I expect are independent
of source language:

1. There are no live registers at the start of the code sequence,
   which violates the assumptions of the compiler.

2. There are no live registers at the end of the code sequence, which
   violates the assumptions of the compiler.

3. We have ignored the register conventions because this machine has
   so few registers.  We only just squeak by on the Pentium, but on a
   machine with only 2 or 3 more registers we could honor the calling
   convention at minimal cost.

The register starvation issue is really deadly.


So here's a radical thought:

The advantage to a really tiny system like L4 is that it's small
enough to compile as a single unit.  One could, in principle, collect
statistics on the operation of the entire kernel, feed the results
back into the compiler, and arrive at a kernel that was optimally
restructured for the application at hand.  The key to this is
automated data restructuring.


Jonathan

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic