'Re: x86-64: new CET-enabled PLT format proposal'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       binutils
Subject:    Re: x86-64: new CET-enabled PLT format proposal
From:       Rui Ueyama via Binutils <binutils () sourceware ! org>
Date:       2022-02-28 3:46:08
Message-ID: CACKH++ZC2W8m5wwu-hfBzdpgta3841A9K6htMU_0yZPn=jZYYA () mail ! gmail ! com
[Download RAW message or body]

On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> <binutils@sourceware.org> wrote:
> >
> > Hello,
> >
> > I'd like to propose an alternative instruction sequence for the Intel
> > CET-enabled PLT section. Compared to the existing one, the new scheme is
> > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> > require a separate second PLT section (.plt.sec).
> >
> > Here is the proposed code sequence:
> >
> >   PLT0:
> >
> >   f3 0f 1e fa        // endbr64
> >   41 53              // push %r11
> >   ff 35 00 00 00 00  // push GOT[1]
> >   ff 25 00 00 00 00  // jmp *GOT[2]
> >   0f 1f 40 00        // nop
> >   0f 1f 40 00        // nop
> >   0f 1f 40 00        // nop
> >   66 90              // nop
> >
> >   PLTn:
> >
> >   f3 0f 1e fa        // endbr64
> >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
>
> All PLT calls will have an extra MOV.

One extra load-immediate mov instruction is executed per a function
call through a PLT entry. It's so tiny that I couldn't see any
difference in real-world apps.

> > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> > PLT entry is called for the first time, the control is passed to PLT0 to call
> > the resolver function.
> >
> > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> > already clobbers it.
> >
> > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> > preserved, nor is it used to pass arguments. Making this register available as
> > scratch register means that code in the PLT need not spill any registers when
> > computing the address to which control needs to be transferred."
> >
> > FYI, this is the current CET-enabled PLT:
> >
> >   PLT0:
> >
> >   ff 35 00 00 00 00    // push GOT[0]
> >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
> >   0f 1f 00             // nop
> >
> >   PLTn in .plt:
> >
> >   f3 0f 1e fa          // endbr64
> >   68 00 00 00 00       // push $namen_reloc_index
> >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
> >   90                   // nop
> >
> >   PLTn in .plt.sec:
> >
> >   f3 0f 1e fa          // endbr64
> >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
> >   0f 1f 44 00 00       // nop
> >
> > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> > have many PLT sections while we have only one header, so in practice, the
> > proposed format is almost 50% smaller than the existing one.
>
> Does it have any impact on performance?   .plt.sec can be placed
> in a different page from .plt.
>
> > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> > has been deprecated.
> >
> > I already implemented the proposed scheme to my linker
> > (https://github.com/rui314/mold) and it looks like it's working fine.
> >
> > Any thoughts?
>
> I'd like to see visible performance improvements or new features in
> a new PLT layout.

I didn't see any visible performance improvement with real-world apps.
I might be able to craft a microbenchmark to hammer PLT entries really
hard in some pattern to see some difference, but I think that doesn't
make much sense. The size reduction is for real though.

> I cced x86-64 psABI mailing list.
>
>
> --
> H.J.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic