[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc
Subject:    Re: Importance of transformations that turn data dependencies into control dependencies?
From:       "Paul E. McKenney" <paulmck () linux ! vnet ! ibm ! com>
Date:       2016-02-26 14:42:51
Message-ID: 20160226144251.GB3522 () linux ! vnet ! ibm ! com
[Download RAW message or body]

On Fri, Feb 26, 2016 at 11:49:29AM +0100, Richard Biener wrote:
> On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote:
> >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel <triegel@redhat.com> wrote:
> >> > I'd like to know, based on the GCC experience, how important we consider
> >> > optimizations that may turn data dependencies of pointers into control
> >> > dependencies.  I'm thinking about all optimizations or transformations
> >> > that guess that a pointer might have a specific value, and then create
> >> > (specialized) code that assumes this value that is only executed if the
> >> > pointer actually has this value.  For example:
> >> >
> >> > int d[2] = {23, compute_something()};
> >> >
> >> > int compute(int v) {
> >> >   if (likely(v == 23)) return 23;
> >> >   else <lots of stuff>;
> >> > }
> >> >
> >> > int bar() {
> >> >   int *p = ptr.load(memory_order_consume);
> >> >   size_t reveal_that_p_is_in_d = p - d[0];
> >> >   return compute(*p);
> >> > }
> >> >
> >> > Could be transformed to (after inlining compute(), and specializing for
> >> > the likely path):
> >> >
> >> > int bar() {
> >> >   int *p = ptr.load(memory_order_consume);
> >> >   if (p == d) return 23;
> >> >   else <lots of stuff(*p)>;
> >> > }
> >>
> >> Note that if a user writes
> >>
> >>   if (p == d)
> >>    {
> >>      ... do lots of stuff via p ...
> >>    }
> >>
> >> GCC might rewrite accesses to p as accesses to d and thus expose
> >> those opportunities.  Is that a transform that isn't valid then or is
> >> the code written by the user (establishing the equivalency) to blame?
> >
> > In the context of this memory_order_consume proposal, this transform
> > would be valid because the program has already "reveiled" what value p
> > has after the branch has been taken.
> >
> >> There's a PR where this kind of equivalencies lead to unexpected (wrong?)
> >> points-to results for example.
> >>
> >> > Other potential examples that come to mind are de-virtualization, or
> >> > feedback-directed optimizations that has observed at runtime that a
> >> > certain pointer is likely to be always equal to some other pointer (eg.,
> >> > if p is almost always d[0], and specializing for that).
> >>
> >> That's the cases that are quite important in practice.
> >
> > Could you quantify this somehow, even if it's a very rough estimate?
> > I'm asking because it's significant and widely used, then this would
> > require users or compiler implementors to make a difficult trade-off
> > (ie, do you want mo_consume performance or performance through those
> > other optimizations?).
> 
> Probably resoved by your followup on how the transform is safe anyway.
> 
> >> > Also, it would be interesting to me to know how often we may turn data
> >> > dependencies into control dependencies in cases where this doesn't
> >> > affect performance significantly.
> >>
> >> I suppose we try to avoid that but can we ever know for sure?  Like
> >> speculative devirtualization does this (with the intent that it _does_ matter,
> >> of course).
> >>
> >> I suppose establishing non-dependence isn't an issue, like with the
> >> vectorizer adding runtime dependence checks and applying versioning
> >> to get a vectorized and a not vectorized loop (in case there are dependences)?
> >
> > I'm not sure I understand you correctly.  Do you have a brief example,
> > perhaps?  For mo_consume and its data dependencies, if there might be a
> > dependence, the compiler would have to preserve it; but I guess that
> > both a vectorized loop an one that accessses each element separately
> > would preserve dependences because it's doing those accesses, and they
> > depend on the input data.
> > OTOH, peraps HW vector instructions don't get the ordering guarantees
> > from data dependences -- Paul, do you know of any such cases?
> 
> A brief example would be for
> 
>  void foo (int *a, int *b, int n)
> {
>   for (int i = 0; i < n; ++i)
>    a[i] = b[i];
> }
> 
> which we can vectorize like
> 
>   if (a + n < b || b + n < a)
>    {
>       vectorized loop
>    }
>   else
>     {
>        not vectorized loop
>     }
> 
> note how we're not establishing equivalences between pointers but
> non-dependence vs. possible dependence.

Thank you for the clarification, I will check.

If I am not too confused, such a loop might want to used x86 non-temporal
stores, which require special handling even with acquire and release,
but that is presumably already handled.

							Thanx, Paul

> >> > The background for this question is Paul McKenney's recently updated
> >> > proposal for a different memory_order_consume specification:
> >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf
> >> >
> >> > In a nutshell, this requires a compiler to either prove that a pointer
> >> > value is not carrying a dependency (simplified, its value somehow
> >> > originates from a memory_order_consume load), or it has to
> >> > conservatively assume that it does; if it does, the compiler must not
> >> > turn data dependencies into control dependencies in generated code.
> >> > (The data dependencies, in contrast to control dependencies, enforce
> >> > memory ordering on archs such as Power and ARM; these orderings than
> >> > allow for not having to use an acquire HW barrier in the generated
> >> > code.)
> >> >
> >> > Given that such a proof will likely be hard for a compiler (dependency
> >> > chains propagate through assignments to variables on the heap and stack,
> >> > chains are not marked in the code, and points-to analysis can be hard),
> >> > a compiler faces a trade-off between either:
> >> > (1) trying to support this memory_order_consume specification and likely
> >> > disallowing all transformations that change data dependencies into
> >> > control dependencies, or
> >> > (2) not support the proposal by simply emitting memory_order_acquire
> >> > code, but get no new constraints on transformations in return (ie, what
> >> > we do for memory_order_consume today).
> >> >
> >> > A compiler could let users make this choice, but this will be hard for
> >> > users too, and the compiler would still have to pick a default.
> >> >
> >> > Therefore, it would be good to know how important such transformations
> >> > or optimizations are in practice.  If they are, it either means somewhat
> >> > slower code everywhere (or at least having to change much in todays
> >> > compilers), or not supporting the new memory_order_consume (at least not
> >> > in the default setting of the compiler).
> >>
> >> IMHO all the memory order semantics are too complicated already so what's
> >> the reason to complicate it even more...?
> >
> > The hardware memory models of archs like Power an ARM guarantee ordering
> > for accesses that have a data dependency, thus allowing things like
> > traversing a concurrently modified list to not have to use acquire
> > memory barriers (ie, a programmer could use memory_order_consume instead
> > of memory_order_acquire).  The Linux kernel uses this heavily in RCU,
> > for example.  It doesn't matter for archs like x86 where acquire /
> > release barriers are implicit.
> 
> That's memory commit ordering on a single CPU core, right?  How would it
> be ever valid to change commit order when there is a data dependence?
> 
> > Paul is looking for a solution that would make both the compiler and the
> > kernel happy; being able to standardize this in C/C++ would be another
> > good outcome from his perspective, I believe.  For GCC, the former might
> > also have benefits because we'd have clearer requirements for something
> > the Linux kernel is using heavily.
> > The Linux kernel guys seem to be opposed to marking the dependency
> > chains a program relies on (except for the initial memory_order_consume
> > load).  I pointed out that this proposal would disallow the
> > transformations I have asked about in this email, and Paul wanted to
> > know how important they are.  So I asked here ... :)
> >
> > If the optimizations that would get disallowed in practice are important
> > for performance, my inclination would be that we might need a different
> > proposal for C++ standardization.  For the kernel use case only, it
> > could nonetheless still be interesting for us to further investigate it;
> > if the kernel doesn't need these optimizations, we'd at least have a
> > better understanding of what the kernel expects and what we deliver.
> >
> >
> 

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic