'[cfe-dev] Typed memory (split from byte type)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cfe-dev
Subject:    [cfe-dev] Typed memory (split from byte type)
From:       Jon Chesterfield via cfe-dev <cfe-dev () lists ! llvm ! org>
Date:       2021-06-14 9:03:40
Message-ID: CAOUYtQAr-JTX-nMwJdc39SKei8e71ApU1SmisrL+pyoh+dWXVw () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

TLDR, we should tag load/store to indicate provenance tracking or
disregarding, not add a new integer type.

Treading carefully here as this is running close to a religious issue -
I've seen morality invoked as an argument against reinterpret_cast, though
thankfully not on this mailing list.

A comment from the introducing a byte type thread.

> Last week Alive2 caught a miscompilation in the Linux kernel, in the
> network stack. The optimization got the pointer arithmetic wrong. Pretty
scary,
> and the bug may have security implications.

Typed memory seems a reasonable consequence of the pointer provenance model
C++ is pursuing (hard to judge whether WG14 is going the same way). It is
probably a reasonable model for clang++ (maybe clang) to work with on those
grounds.

Some of the things that typed memory would rule out are mmap data
structures from disk (e.g. elf, hash tables) and ptrtoint trickery (NaN
boxing, pointer tagging. Losing mmap of an elf makes toolchains slower.
Treating NaN boxing or pointer tagging as UB means dynamic languages need
to find a new host. Network code is infamous for reading raw bytes off the
wire. Implementing parts of libm involves reading the
mantissa/exponent/sign bit of IEEE floats. Pragmatism will therefore
motivate an escape hatch, like memcpy was on untyped memory.

Assume for a moment that LLVM manages to represent typed memory and untyped
memory, in some fashion that remains internally consistent. Clang can then
mostly emit IR using typed memory while the escape hatches (memcpy,
bitcast, I haven't kept up with the implicit object creation proposals)
emit IR using untyped memory.

Languages that want to be machine-like can emit untyped memory IR and ones
that want to be highlevel-like can emit typed memory IR, with ones that try
to do both emitting a mixture.

This sounds like a different load/store instruction to me, not a different
type. It's still an 'i32' once it's in an SSA variable. More likely a tag
on the load/store to indicate whether pointer provenance tracks through it
or not, since it seems likely we can relax the 'typed' version to 'untyped'
before hitting codegen.

There is some prior art here on atomic. From the machine perspective,
'atomic' is obviously a property of the instruction. From the C++
perspective, 'atomic' is defined as a property of the type. We seem to
handle that difference in perspective well enough so we can probably handle
untyped/typed memory at the boundary to memory, which is mostly load/store.

Thanks all,

Jon

p.s. I'd much rather we throw out pointer provenance and go back to the
good old days where no-strict-aliasing was just how things work because I
seem to write code in domains that collides with that edge a lot, mostly in
toolchains.

[Attachment #5 (text/html)]

<div dir="ltr">TLDR, we should tag load/store to indicate provenance tracking or \
disregarding, not add a new integer type.<div><br></div><div>Treading carefully here \
as this is running close to a religious issue - I&#39;ve seen morality invoked as an \
argument against reinterpret_cast, though thankfully not on this mailing \
list.<div></div><br><div>A comment from the introducing a byte type thread. \
<div><div><br></div><div>&gt; Last week Alive2 caught a miscompilation in the Linux \
kernel, in the<br>&gt; network  stack. The optimization got the pointer arithmetic \
wrong. Pretty scary,<br>&gt; and the bug may have security \
implications.<br></div><div><br></div><div>Typed memory seems a reasonable \
consequence of the pointer provenance model C++ is pursuing (hard to judge whether \
WG14 is going the same way). It is probably a reasonable model for clang++ (maybe \
clang) to work with on those grounds.</div><div><br></div><div>Some of the things \
that typed memory would rule out are mmap data structures from disk (e.g. elf, hash \
tables) and ptrtoint trickery (NaN boxing, pointer tagging. Losing mmap of an elf \
makes toolchains slower. Treating NaN boxing or pointer tagging as UB means dynamic \
languages need to find a new host. Network code is infamous for reading raw bytes off \
the wire. Implementing parts of libm involves reading the mantissa/exponent/sign bit \
of IEEE floats. Pragmatism will therefore motivate an escape hatch, like memcpy was \
on untyped memory.</div><div><br></div><div>Assume for a moment that LLVM manages to \
represent typed memory and untyped memory, in some fashion that remains internally \
consistent. Clang can then mostly emit IR using typed memory while the escape hatches \
(memcpy, bitcast, I haven&#39;t kept up with the implicit object creation proposals) \
emit IR using untyped memory.</div><div><br></div><div>Languages that want to be \
machine-like can emit untyped memory IR and ones that want to be highlevel-like can \
emit typed memory IR, with ones that try to do both emitting a \
mixture.</div><div><br></div><div>This sounds like a different load/store instruction \
to me, not a different type. It&#39;s still an &#39;i32&#39; once it&#39;s in an SSA \
variable. More likely a tag on the load/store to indicate whether pointer provenance \
tracks through it or not, since it seems likely we can relax the &#39;typed&#39; \
version to &#39;untyped&#39; before hitting codegen.</div><div><br></div><div>There \
is some prior art here on atomic. From the machine perspective, &#39;atomic&#39; is \
obviously a property of the instruction. From the C++ perspective, &#39;atomic&#39; \
is defined as a property of the type. We seem to handle that difference in \
perspective well enough so we can probably handle untyped/typed memory at the \
boundary to memory, which is mostly load/store.</div><div><br></div><div>Thanks \
all,</div><div><br></div><div>Jon</div><div><br></div><div>p.s. I&#39;d much rather \
we throw out pointer provenance and go back to the good old days where \
no-strict-aliasing was just how things work because I seem to write code in domains \
that collides with that edge a lot, mostly in \
toolchains.</div><div><br></div><div><br></div></div></div></div></div>

[Attachment #6 (text/plain)]

_______________________________________________
cfe-dev mailing list
cfe-dev@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

[prev in list] [next in list] [prev in thread] [next in thread]