[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-kernel
Subject:    Re: [PATCH v3] tty: tty_io: remove hung_up_tty_fops
From:       "Paul E. McKenney" <paulmck () kernel ! org>
Date:       2024-05-04 22:04:49
Message-ID: 37195203-9a13-46aa-9cc0-5effea3c4b0e () paulmck-laptop
[Download RAW message or body]

On Sat, May 04, 2024 at 12:11:10PM -0700, Linus Torvalds wrote:
> On Sat, 4 May 2024 at 11:18, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Here is my current thoughts for possible optimizations of non-volatile
> > memory_order_relaxed atomics:
> >
> > o       Loads from the same variable that can legitimately be
> >         reordered to be adjacent to one another can be fused
> >         into a single load.
> 
> Let's call this "Rule 1"
> 
> I think you can extend this to also "can be forwarded from a previous store".

Agreed, with constraints based on intervening synchronization.

> I also think you're too strict in saying "fused into a single load".
> Let me show an example below.

I certainly did intend to make any errors in the direction of being
too strict.

> > o       Stores to the same variable that can legitimately be
> >         reordered to be adjacent to one another can be replaced
> >         by the last store in the series.
> 
> Rule 2.
> 
> Ack, although again, I think you're being a bit too strict in your
> language, and the rule can be relaxed.
> 
> > o       Loads and stores may not be invented.
> 
> Rule 3.
> 
> I think you can and should relax this. You can invent loads all you want.

I might be misunderstanding you, but given my understanding, I disagree.
Consider this example:

	x = __atomic_load(&a, RELAXED);
	r0 = x * x + 2 * x + 1;

It would not be good for a load to be invented as follows:

	x = __atomic_load(&a, RELAXED);
	invented = __atomic_load(&a, RELAXED);
	r0 = x * x + 2 * invented + 1;

In the first example, we know that r0 is a perfect square, at least
assuming that x is small enough to avoid wrapping.  In the second
example, x might not be equal to the value from the invented load,
and r0 might not be a perfect square.

I believe that we really need the compiler to keep basic arithmetic
working.

That said, I agree that this disallows directly applying current
CSE optimizations, which might make some people sad.  But we do need
code to work regardless.

Again, it is quite possible that I am misunderstanding you here.

> > o       The only way that a computation based on the value from
> >         a given load can instead use some other load is if the
> >         two loads are fused into a single load.
> 
> Rule 4.
> 
> I'm not convinced that makes sense, and I don't think it's true as written.
> 
> I think I understand what you are trying to say, but I think you're
> saying it in a way that only confuses a compiler person.
> 
> In particular, the case I do not think is true is very much the
> "spill" case: if you have code like this:
> 
>     a = expression involving '__atomic_load_n(xyz, RELAXED)'
> 
> then it's perfectly fine to spill the result of that load and reload
> the value. So the "computation based on the value" *is* actually based
> on "some other load" (the reload).

As in the result is stored to a compiler temporary and then reloaded
from that temporary?  Agreed, that would be just fine.  In contrast,
spilling and reloading from xyz would not be good at all.

> I really *REALLY* think you need to explain the semantics in concrete
> terms that a compiler writer would understand and agree with.

Experience would indicate that I should not dispute sentence.  ;-)

> So to explain your rules to an actual compiler person (and relax the
> semantics a bit) I would rewrite your rules as:
> 
>  Rule 1: a strictly dominating load can be replaced by the value of a
> preceding load or store
> 
>  Ruie 2: a strictly dominating store can remove preceding stores
> 
>  Rule 3: stores cannot be done speculatively (put another way: a
> subsequent dominating store can only *remove* a store entirely, it
> can't turn the store into one with speculative data)
> 
>  Rule 4: loads cannot be rematerialized (ie a load can be *combined*
> as per Rule 1, but a load cannot be *split* into two loads)

I still believe that synchronization operations need a look-in, and
I am not sure what is being dominated in your Rules 1 and 2 (all
subsequent execution?), but let's proceed.

> Anyway, let's get to the examples of *why* I think your language was
> bad and your rules were too strict.
> 
> Let's start with your Rule 3, where you said:
> 
>  - Loads and stores may not be invented
> 
> and while I think this should be very very true for stores, I think
> inventing loads is not only valid, but a good idea. Example:
> 
>     if (a)
>         b = __atomic_load_n(ptr) + 1;
> 
> can perfectly validly just be written as
> 
>     tmp = __atomic_load_n(ptr);
>     if (a)
>         b = tmp+1;
> 
> which in turn may allow other optimizations (ie depending on how 'b'
> is used, the conditional may go away entirely, and you just end up
> with 'b = tmp+!!a').
> 
> There's nothing wrong with extra loads that aren't used.

From a functional viewpoint, if the value isn't used, then agreed,
inventing the load is harmless.  But there are some code sequences where
I really wouldn't want the extra cache miss.

> And to make that more explicit, let's look at Rule 1:
> 
> Example of Rule 1 (together with the above case):
> 
>     if (a)
>         b = __atomic_load_n(ptr) + 1;
>     else
>         b =  __atomic_load_n(ptr) + 2;
>     c = __atomic_load_n(ptr) + 3;
> 
> then that can perfectly validly re-write this all as
> 
>     tmp = __atomic_load_n(ptr);
>     b = a ? tmp+1 : tmp+2;
>     c = tmp + 3;
> 
> because my version of Rule 1 allows the dominating load used for 'c'
> to be replaced by the value of a preceding load that was used for 'a'
> and 'b'.

OK, I thought that nodes early in the control-flow graph dominated
nodes that are later in that graph, but I am not a compiler expert.

In any case, I agree with this transformation.  This is making three
loads into one load, and there is no intervening synchronization to gum
up the works.

> And to give an example of Rule 2, where you said "reordered to be
> adjacent", I'm saying that all that matters is being strictly
> dominant, so
> 
>     if (a)
>         __atomic_store_n(ptr,1);
>     else
>         __atomic_store_n(ptr,2);
>     __atomic_store_n(ptr,3);
> 
> can be perfectly validly be combined into just
> 
>     __atomic_store_n(ptr,3);
> 
> because the third store completely dominates the two others, even if
> in the intermediate form they are not necessarily ever "adjacent".

I agree with this transformation as well.  But suppose that the code
also contained an smp_mb() right after that "if" statement.  Given that,
it is not hard to construct a larger example in which dropping the first
two stores would be problematic.

> (Your "adjacency" model might still be valid in how you could turn
> first of the first stores to be a fall-through, then remove it, and
> then turn the other to be a fall-through and then remove it, so maybe
> your language isn't _tecnically_ wrong, But I think the whole
> "dominating store" is how a compiler writer would think about it).

I was thinking in terms of first transforming the code as follows:

	if (a) {
		__atomic_store_n(ptr,1);
		__atomic_store_n(ptr,3);
	} else {
		__atomic_store_n(ptr,2);
		__atomic_store_n(ptr,3);
	}

(And no, I would not expect a real compiler to do this!)

Then it is clearly OK to further transform into the following:

	if (a) {
		__atomic_store_n(ptr,3);
	} else {
		__atomic_store_n(ptr,3);
	}

At which point both branches of the "if" statement are doing the
same thing, so:

	__atomic_store_n(ptr,3);

On to your next email!

							Thanx, Paul

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic