[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc
Subject:    Re: Importance of transformations that turn data dependencies into control dependencies?
From:       Richard Biener <richard.guenther () gmail ! com>
Date:       2016-03-01 9:40:28
Message-ID: CAFiYyc1Q=vb0xz2K6UhJ=j=uz6_BvG0wGXdqpLBqnvOf43TwCg () mail ! gmail ! com
[Download RAW message or body]

On Fri, Feb 26, 2016 at 8:10 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2016-02-26 at 11:49 +0100, Richard Biener wrote:
>> On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegel <triegel@redhat.com> wrote:
>> > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote:
>> >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel <triegel@redhat.com> wrote:
>> >> > I'd like to know, based on the GCC experience, how important we consider
>> >> > optimizations that may turn data dependencies of pointers into control
>> >> > dependencies.  I'm thinking about all optimizations or transformations
>> >> > that guess that a pointer might have a specific value, and then create
>> >> > (specialized) code that assumes this value that is only executed if the
>> >> > pointer actually has this value.  For example:
>> >> >
>> >> > int d[2] = {23, compute_something()};
>> >> >
>> >> > int compute(int v) {
>> >> >   if (likely(v == 23)) return 23;
>> >> >   else <lots of stuff>;
>> >> > }
>> >> >
>> >> > int bar() {
>> >> >   int *p = ptr.load(memory_order_consume);
>> >> >   size_t reveal_that_p_is_in_d = p - d[0];
>> >> >   return compute(*p);
>> >> > }
>> >> >
>> >> > Could be transformed to (after inlining compute(), and specializing for
>> >> > the likely path):
>> >> >
>> >> > int bar() {
>> >> >   int *p = ptr.load(memory_order_consume);
>> >> >   if (p == d) return 23;
>> >> >   else <lots of stuff(*p)>;
>> >> > }
>> >>
>> >> Note that if a user writes
>> >>
>> >>   if (p == d)
>> >>    {
>> >>      ... do lots of stuff via p ...
>> >>    }
>> >>
>> >> GCC might rewrite accesses to p as accesses to d and thus expose
>> >> those opportunities.  Is that a transform that isn't valid then or is
>> >> the code written by the user (establishing the equivalency) to blame?
>> >
>> > In the context of this memory_order_consume proposal, this transform
>> > would be valid because the program has already "reveiled" what value p
>> > has after the branch has been taken.
>> >
>> >> There's a PR where this kind of equivalencies lead to unexpected (wrong?)
>> >> points-to results for example.
>> >>
>> >> > Other potential examples that come to mind are de-virtualization, or
>> >> > feedback-directed optimizations that has observed at runtime that a
>> >> > certain pointer is likely to be always equal to some other pointer (eg.,
>> >> > if p is almost always d[0], and specializing for that).
>> >>
>> >> That's the cases that are quite important in practice.
>> >
>> > Could you quantify this somehow, even if it's a very rough estimate?
>> > I'm asking because it's significant and widely used, then this would
>> > require users or compiler implementors to make a difficult trade-off
>> > (ie, do you want mo_consume performance or performance through those
>> > other optimizations?).
>>
>> Probably resoved by your followup on how the transform is safe anyway.
>
> Are you saying this because de-virtualization would be the only
> optimization that matters for performance and turns data into control
> dependencies (but can be considered safe)?
>
> What about feedback-directed optimizations, for which you said it would
> be one of the important cases too?  Will these only look equivalences in
> non-mutable data too (ie, like vtables)?

We do have other value-profiling transforms - for example divison/modulo
of a constant:

  if (x == 256)
    z = y / 256;
  else
    z = y / x;

> What about the example I gave above?  Is it unrealistic for compilers do
> ever do something like this, or is it just unlikely to gain much
> performance, or is it just that GCC does not do this today?

GCC does not do this today with the exception of value-profiling.  GCC in
other cases does not establish equivalences but only relations (a < b, etc.)
that are not a problem as far as I can see because those do not allow to
change expressions using a to use b.

>> >> > Also, it would be interesting to me to know how often we may turn data
>> >> > dependencies into control dependencies in cases where this doesn't
>> >> > affect performance significantly.
>> >>
>> >> I suppose we try to avoid that but can we ever know for sure?  Like
>> >> speculative devirtualization does this (with the intent that it _does_ matter,
>> >> of course).
>> >>
>> >> I suppose establishing non-dependence isn't an issue, like with the
>> >> vectorizer adding runtime dependence checks and applying versioning
>> >> to get a vectorized and a not vectorized loop (in case there are dependences)?
>> >
>> > I'm not sure I understand you correctly.  Do you have a brief example,
>> > perhaps?  For mo_consume and its data dependencies, if there might be a
>> > dependence, the compiler would have to preserve it; but I guess that
>> > both a vectorized loop an one that accessses each element separately
>> > would preserve dependences because it's doing those accesses, and they
>> > depend on the input data.
>> > OTOH, peraps HW vector instructions don't get the ordering guarantees
>> > from data dependences -- Paul, do you know of any such cases?
>>
>> A brief example would be for
>>
>>  void foo (int *a, int *b, int n)
>> {
>>   for (int i = 0; i < n; ++i)
>>    a[i] = b[i];
>> }
>>
>> which we can vectorize like
>>
>>   if (a + n < b || b + n < a)
>>    {
>>       vectorized loop
>>    }
>>   else
>>     {
>>        not vectorized loop
>>     }
>>
>> note how we're not establishing equivalences between pointers but
>> non-dependence vs. possible dependence.
>
> Okay, non-dependence of this kind (/ disjoint-access parallelism) isn't
> a problem, as long as both the vectorized and non-vectorized loops would
> perform the memory accesses using HW instructions that communicate the
> dependence to the HW (ie, use a / b as base) and for which the HW
> establishes ordering due to the dependences (ie, my question about
> vector ops to Paul).
>
> Note that if the non-vectorized loop would do something like this:
> if (a + n > b && a < b)
>   {
>     int n2 = n - (b-a);
>     for (;a < b;a++, b++)  /* vectorized copy */
>     for (i = n2;i < n; i++) b[i] = b[i+n2];
>   }
>
> We'd have a problem because the second part doesn't communicate the
> dependence on a to the HW.

Huh?  IVOPTs does create such IVs already.  But the actual access
is of course still "original" so I fail to see how the HW does not see
the dependence.

> Maybe that's another class of transforms we'd have to consider?  That
> is, transforms that figure out that some derived pointers (eg, a+n2 / b)
> are equal and then carry forward with using just one of those base
> dependencies in the generated code?

GCC does this today (though it causes some issues in GCC itself so we
tried to avoid it, but certainly not for HW correctness reasons -
which I fail to see).

>> >> > The background for this question is Paul McKenney's recently updated
>> >> > proposal for a different memory_order_consume specification:
>> >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf
>> >> >
>> >> > In a nutshell, this requires a compiler to either prove that a pointer
>> >> > value is not carrying a dependency (simplified, its value somehow
>> >> > originates from a memory_order_consume load), or it has to
>> >> > conservatively assume that it does; if it does, the compiler must not
>> >> > turn data dependencies into control dependencies in generated code.
>> >> > (The data dependencies, in contrast to control dependencies, enforce
>> >> > memory ordering on archs such as Power and ARM; these orderings than
>> >> > allow for not having to use an acquire HW barrier in the generated
>> >> > code.)
>> >> >
>> >> > Given that such a proof will likely be hard for a compiler (dependency
>> >> > chains propagate through assignments to variables on the heap and stack,
>> >> > chains are not marked in the code, and points-to analysis can be hard),
>> >> > a compiler faces a trade-off between either:
>> >> > (1) trying to support this memory_order_consume specification and likely
>> >> > disallowing all transformations that change data dependencies into
>> >> > control dependencies, or
>> >> > (2) not support the proposal by simply emitting memory_order_acquire
>> >> > code, but get no new constraints on transformations in return (ie, what
>> >> > we do for memory_order_consume today).
>> >> >
>> >> > A compiler could let users make this choice, but this will be hard for
>> >> > users too, and the compiler would still have to pick a default.
>> >> >
>> >> > Therefore, it would be good to know how important such transformations
>> >> > or optimizations are in practice.  If they are, it either means somewhat
>> >> > slower code everywhere (or at least having to change much in todays
>> >> > compilers), or not supporting the new memory_order_consume (at least not
>> >> > in the default setting of the compiler).
>> >>
>> >> IMHO all the memory order semantics are too complicated already so what's
>> >> the reason to complicate it even more...?
>> >
>> > The hardware memory models of archs like Power an ARM guarantee ordering
>> > for accesses that have a data dependency, thus allowing things like
>> > traversing a concurrently modified list to not have to use acquire
>> > memory barriers (ie, a programmer could use memory_order_consume instead
>> > of memory_order_acquire).  The Linux kernel uses this heavily in RCU,
>> > for example.  It doesn't matter for archs like x86 where acquire /
>> > release barriers are implicit.
>>
>> That's memory commit ordering on a single CPU core, right?  How would it
>> be ever valid to change commit order when there is a data dependence?
>
> No, this is across cores.  One can do this to publish an initialized
> object, for example:
>   obj.data = init();
>   ptr.store(obj, memory_order_release);
> and consume it, being sure that one will see initialized data:
>   p = ptr.load(memory_order_acquire);
>   my_data = p->data;
> This will emit an acquire HW fence (or similar) on archs such as Power.
>
> What mo_consume wants to enable is programmers to write this:
>   p = ptr.load(memory_order_consume);
>   my_data = p->data;
> This would be compiled to a plain atomic load of ptr (no acquire fence,
> so like memory_order_relaxed), followed by a plain memory access to ptr
> plus the offset of data.  The HW instruction of the latter would
> transform dependences such as the one on ptr into an ordering guarantee
> that makes sure that the load from data does not "happen before" the
> load from ptr, so that even in this case we'd always observe an
> initialized object.

But there is a data dependence between the two instructions so I fail to
see how the HW can ever execute the load from p->data before the
load of p itself (from ptr).  I'm quite sure no HW does value-speculation
of 'p' to be able to load p->data before actually knowing the value of 'p' ;)

Richard.

>
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic