[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc-patches
Subject:    Re: [PATCH] Replace EVRP in DOM with ranger.
From:       Aldy Hernandez via Gcc-patches <gcc-patches () gcc ! gnu ! org>
Date:       2022-04-29 16:22:11
Message-ID: CAGm3qMX+-MQO0MXaN5_FzStW7mf73X9T80Wo4WdO2=Dhf1kCrg () mail ! gmail ! com
[Download RAW message or body]

On Fri, Apr 29, 2022 at 12:13 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, Apr 29, 2022 at 11:53 AM Aldy Hernandez <aldyh@redhat.com> wrote:
> >
> > On Fri, Apr 29, 2022 at 9:09 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Thu, Apr 28, 2022 at 8:44 PM Jeff Law via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > >
> > > >
> > > > On 4/28/2022 10:30 AM, Aldy Hernandez wrote:
> > > > > [Jeff, this is the same patch I sent you last week with minor tweaks
> > > > > to the commit message.]
> > > > And I just dropped it into the tester.  We should have the vast majority
> > > > of targets tested by this time tomorrow.
> > > >
> > > > >
> > > > > [Despite the verbosity of the message, this is actually a pretty
> > > > > straightforward patch.  It should've gone in last cycle, but there
> > > > > was a nagging regression I couldn't get to until after stage1
> > > > > had closed.]
> > > > You had other, non-work/gcc priorities.  No worries at missing gcc-12.
> > > >
> > > >
> > > > >
> > > > > There are 3 uses of EVRP in DOM that must be converted.
> > > > > Unfortunately, they need to be converted in one go, so further
> > > > > splitting of this patch would be problematic.
> > > > ACK
> > > >
> > > >
> > > > > There's nothing here earth shattering.  It's all pretty obvious in
> > > > > retrospect, but I've added a short description of each use to aid in
> > > > > reviewing:
> > > > >
> > > > > * Convert evrp use in cprop to ranger.
> > > > >
> > > > >    This is easy, as cprop in DOM was converted to the ranger API last
> > > > >    cycle, so this is just a matter of using a ranger instead of an
> > > > >    evrp_range_analyzer.
> > > > Yea.  We may have covered this before, but what does ranger do with a
> > > > conditional equivalence?
> > > >
> > > > ie if we have something like
> > > >
> > > > if (a == b)
> > > >    {
> > > >      blah blah
> > > >    }
> > > >
> > > > Within the true arm we know there's an equivalence.  *But* we don't want
> > > > to just blindly replace a with b or vice-versa.  The equivalence is
> > > > primarily useful for simplification rather than propagation.
> > > >
> > > > In fact, we every much do not want to cprop in these cases.   It rarely
> > > > helps in a meaningful way and there's no good way to know which is the
> > > > better value to use.  Canonicalization on SSA_NAMEs doesn't  really help
> > > > due to SSA_NAME recycling.
> > > >
> > > >
> > > > >
> > > > > * Convert evrp use in threader to ranger.
> > > > >
> > > > >    The idea here is to use the hybrid approach we used for the initial
> > > > >    VRP threader conversion last cycle.  The DOM threader will continue
> > > > >    using the forward threader infrastructure while continuing to query
> > > > >    DOM data structures, and only if the conditional does not relsolve,
> > > > >    using the ranger.  This gives us the best of both worlds, and is a
> > > > >    proven approach.
> > > > >
> > > > >    Furthermore, as frange and prange come live in the next cycle, we
> > > > >    can move away from the forward threader altogether, and just add
> > > > >    another backward threader.  This will not only remove the last use
> > > > >    of the forward threader, but will allow us to remove at least 1 or 2
> > > > >    threader instances.
> > > > Excellent.
> > > >
> > > > >
> > > > > * Convert conditional folding to use the method used by the ranger and
> > > > >    evrp.  Previously DOM was calling into the guts of
> > > > >    simplify_using_ranges::vrp_visit_cond_stmt.  The blessed way now is
> > > > >    using fold_cond() which rewrites the conditional and edges
> > > > >    automatically.
> > > > >
> > > > >    When legacy is removed, simplify_using_ranges will be further
> > > > >    cleaned up, and there will only be one entry point into simplifying
> > > > >    a statement.
> > > > Sure.
> > > >
> > > > >
> > > > > * DOM was setting global ranges determined from unreachable edges as a
> > > > >    side-effect of using the evrp engine.  We must handle these cases
> > > > >    before nuking evrp, and DOM seems like a good fit.  I've just moved
> > > > >    the snippet to DOM, but it could live anywhere else we do a DOM
> > > > >    walk.
> > > >   It was a semi-desirable to record/refine global ranges in DOM, but I'd
> > > > be surprised if too much code depended on that.  Presumably there's a
> > > > testcase in the suite which we don't handle well if we don't let DOM
> > > > refine/record global ranges?
> > > >
> > > > >
> > > > >    For the record, this is the case *vrp handled:
> > > > >
> > > > >       <bb C>:
> > > > >       ...
> > > > >       if (c_5(D) != 5)
> > > > >       goto <bb N>;
> > > > >       else
> > > > >       goto <bb M>;
> > > > >       <bb N>:
> > > > >       __builtin_unreachable ();
> > > > >       <bb M>:
> > > > >
> > > > >    If M dominates all uses of c_5, we can set the global range of c_5
> > > > >    to [5,5].
> > > > And that will only happen if M eventually loops back to the conditional,
> > > > right?  Otherwise all the uses wouldn't be dominated. I suspect this is
> > > > exceedingly rare in practice an it looks really familiar.  This is
> > > > coming from a testcase, right?
> > >
> > > This is how we make use of assert() when it uses __builtin_unreachable.
> > >
> > > Note the difficulty here is that we have to do this at a very specific point
> > > in the pass pipeline (setting the global range), since the first time we'll
> > > fold the if (c_5(D) != 5) it will be folded to false and the IL representation
> > > of the assert() is gone.
> > >
> > > So the idea was that with the IL present value-range analysis can determine
> > > the range omf c_5(D) on the false path but once we remove it (and we do
> > > want to remove it) we want to put a global range on c_5(D).
> > >
> > > That means the best of both worlds would be to detect this pattern at
> > > a specific point in the pass pipeline (somewhen before loop opts for sure)
> > > and do both - remove the IL _and_ set the global range.  The fear of
> > > doing this too early is of course that we lose the range by copy propagation
> > > or similar transforms.  The fear of doing this change at a single place only
> > > is that we miss it because there's a stray use before the conditional.
> > >
> > > Note DOM is also run in pass_ipa_oacc_kernels which may be too early.
> > >
> > > In the end we want to perform dead code elimination here so maybe
> > > DCE is the best fit to do this (but as said we want to avoid doing this too
> > > early).
> >
> > I agree that doing both would be the ideal scenario.  However, I would
> > prefer this be done in a follow-up patch to minimize the amount of
> > stuff being changed in one go.
> >
> > To provide additional background, this was being done in evrp which
> > runs at -O2, but was also being done at -O1 as a side-effect of DOM
> > using the evrp engine and exporting global ranges.  For -O2, I don't
> > know how much DOM was originally (prior to ranger) contributing to
> > these assert global ranges,  since evrp was probably picking them up
> > first.  However, since evrp now runs in ranger mode, this
> > functionality has silently moved to DOM (without us realizing) for all
> > compilation levels.
>
> I see.
>
> > Do we care about this functionality at -O1?  ISTR at least one -O1
> > testcase failing in RTL land if exporting global ranges isn't done in
> > DOM.  I don't remember if it was a global range due to a
> > __builtin_unreachable or otherwise.
>
> Well, at -O1 we definitely didn't set the global range (did we?) but
> eventually pruned paths running into __builtin_unreachable ()
> anyway (would have to check at which point).

I think we always have:

evrp_range_analyzer analyzer (/*update_global_ranges=*/true);

??

DOM has been running as an incognito evrp even at -O1.  I don't know
if this has been by design, or as an unintended side-effect.  I mean,
it has been visiting all conditionals in the IL as well as calling
evrp_range_analyzer::record_ranges_from_stmt() on each statement it
tries to optimize.  And of course, updating global ranges along the
way.

>
> >
> > One more thing, we have batted around the idea of running a fast evrp
> > even at -O1 (ranger without looking at back edges, IIRC).  This would
> > mean that when the DOM threader becomes disentangled from DOM (when we
> > have frange and prange), there's very little DOM will be doing that
> > for instance PRE can't get.  So perhaps, if we remove DOM, the place
> > to put this optimization is in the fast evrp, and fold the conditional
> > as has been suggested?  Anyways... I'm digressing.
>
> The idea is to get rid of DOM once its threading is separated out.

Excellent.  There's no reason why we couldn't separate threading out
in this release.

>
> I think doing O (n * log n) value range analysis at -O1 would be welcome.
> We didn't get around doing that for EVRP which I think had such
> bound - the iterating VRP definitely didn't.  I'm not sure ranger really
> qualifies given gori and all it relies on.  The advantage of the simple
> whole-IL walk way was to reasonably easily guarantee such bound.
> That's unfortunately lost now :/

Andrew was mumbling something about a fast ranger mode for this
release that should be on par with legacy evrp.  IIRC it would be
purely DOM based, won't visit back edges, and there's no caching.  But
he'll have to expand on it when he returns from vacation.  I don't
know the details.

>
> > Doing this optimization in DOM at least keeps with GCC 12, but I'm
> > happy to move it to whatever is recommended.
>
> As said the intent of this trick is to preserve range information from
> asserts even when we removed the IL carrying it.  When we care
> about range info, which means when some VRP pass runs, which
> is why it was done in such a pass.  That is, we probably want to
> avoid altering the global range for -fno-tree-vrp.

Not just -fno-tree-vrp then.  There are other range consumers.  For
example, CCP, threading, etc.  Threading will look at global ranges
even in dumb mode, and CCP uses global ranges in a round about way--
by looking at global nonzero bits which the evrp code diligently sets:

        {
          set_ssa_range_info (vrs[i].first, vrs[i].second);
          maybe_set_nonzero_bits (pred_e, vrs[i].first);
        }

For example, in this particular testcase, DOM will set the global
range, and CCP will remove at least one of the conditionals by looking
at nonzero bits.

>
> > The test case is gcc.dg/builtin-unreachable-6a.c by the way.
>
> But 6.c checks the same but even w/ DOM disabled...

That's what I thought, but they actually test the opposite:

/* { dg-final { scan-tree-dump-times "__builtin_unreachable" 1 "fab1" } } */
/* { dg-final { scan-tree-dump-not "__builtin_unreachable" "fab1" } } */

The non-DOM, non-VRP, version checks that the unreachable is still
there.  Whereas the plain -O2 version checks that we've cleaned it up.

>
> > >
> > > > This is the only part of the patch that makes me go "ewwww", so I would
> > > > like to at least explore if we can rethink the value of that test.
> > > >
> > > > > There is a a regression on Wstringop-overflow-4.C which I'm planning
> > > > > on XFAILing.  It's another variant of the usual middle-end false
> > > > > positives: having no ranges produces no warnings, but slightly refined
> > > > > ranges, or worse-- isolating specific problematic cases in the
> > > > > threader causes flare-ups.
> > > > ACK.
> > > >
> > > >
> > > > >
> > > > > As an aside, as Richi has suggested, I think we should discuss
> > > > > restricting the threader's ability to thread highly unlikely paths.
> > > > > These cause no end of pain for middle-end warnings.  However,
> > > > > I don't know if this would conflict with path isolation for
> > > > > things like null dereferencing.  ISTR you were interested in this.
> > > > I've never though too  much about it.  You're thinking of Richi :-)
> > > >
> > > > It's a balance.  I bet if we stop threading those paths we're going to
> > > > see other issues start popping up like uninitialized warnings.
> > > >
> > > > It may be the case that it's really the constant propagation on an
> > > > isolated path that's more problematical for those warnings.  But I would
> > > > probably contend that what isolation is doing is showing how certain
> > > > constant values can flow into places where they're not expected and
> > > > could cause problems.  So I wouldn't necessary suggest throttling it.
> > > >
> > > > Happy to be proven wrong here!
> > >
> > > I think we need to revisit the whole costing of threading which is of course
> > > best done when we only have the backwards threader left.  I also always
> > > wanted to play with some kind of optimistic code generation & rollback
> > > on things like threading and unrolling, basically value-number the
> > > duplicated stmts on-the-fly and make it "stop" when we reach a defined
> > > threshold (and then throw away the result).  I never went around to
> > > do this for unrolling since there we can end up duplicating diverging
> > > control flow, but that at least doesn't happen with threading(?)
> >
> > WRT stopping after a threshold is reached,
> > back_threader_profitablity::profitable_path_p() bails after a number
> > of statements is reached.  If so, the path is not even put into the
> > queue.  So, no rolling back would be necessary.  Is this what you
> > meant, or something else?  The threaders (all variants) queue things
> > up and modify the IL at the end of the pass.
>
> I meant that as currently we have no idea about followup simplifications
> on the isolated path we conservatively account for all stmts in there
> and thus need an optimistically high threading --param.  We should
> be able to do better by simplifying copied stmts with the some now
> constant PHIs or conditionally derived values.  So we could do
> "unbound" analysis but when we instantiate stop the actual copying
> if simplification doesn't keep us within the desired bounds.
> For example unrolling tries to estimate what's going to be optimized
> away but if we actually did the simplification that would be more precise.

Very cool!

Aldy

>
> > >
> > > And just a note - please wait with installing this change - if approved - until
> > > after GCC 12 is actually released.
> >
> > Indeed.  This is far too invasive for the interim.  I just sent the
> > patch early so Jeff could get a chance to review it before I leave on
> > paternity leave.  But he's also volunteered to kick it over the goal
> > line if we can't get it in before May 9.
> >
> > Aldy
> >
>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic