'Re: [PATCH v9 bpf-next 3/5] bpf: Add skb dynptrs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-netdev
Subject:    Re: [PATCH v9 bpf-next 3/5] bpf: Add skb dynptrs
From:       Joanne Koong <joannelkoong () gmail ! com>
Date:       2023-01-31 23:17:08
Message-ID: CAJnrk1bF+g_2SQ8HaNx0Fb+E42HBi3XJa3M=+y71=XHN1_wdrg () mail ! gmail ! com
[Download RAW message or body]

 oOn Mon, Jan 30, 2023 at 9:30 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> 
> On Mon, Jan 30, 2023 at 08:43:47PM -0800, Andrii Nakryiko wrote:
> > On Mon, Jan 30, 2023 at 5:49 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > 
> > > On 1/30/23 5:04 PM, Andrii Nakryiko wrote:
> > > > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > 
> > > > > On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > > > > > On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > > > > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct \
> > > > > > > bpf_verifier_env *env, struct bpf_insn *insn mark_reg_known_zero(env, \
> > > > > > > regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > > > > regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > > > > +           if (func_id == BPF_FUNC_dynptr_data &&
> > > > > > > +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > > > +                   bool seen_direct_write = env->seen_direct_write;
> > > > > > > +
> > > > > > > +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > > > > > +                   if (!may_access_direct_pkt_data(env, NULL, \
> > > > > > > BPF_WRITE)) +                           regs[BPF_REG_0].type |= \
> > > > > > > MEM_RDONLY; +                   else
> > > > > > > +                           /*
> > > > > > > +                            * Calling may_access_direct_pkt_data() \
> > > > > > > will set +                            * env->seen_direct_write to true \
> > > > > > > if the skb is +                            * writable. As an \
> > > > > > > optimization, we can ignore +                            * setting \
> > > > > > > env->seen_direct_write. +                            *
> > > > > > > +                            * env->seen_direct_write is used by skb
> > > > > > > +                            * programs to determine whether the skb's \
> > > > > > > page +                            * buffers should be cloned. Since \
> > > > > > > data slice +                            * writes would only be to the \
> > > > > > > head, we can skip +                            * this.
> > > > > > > +                            */
> > > > > > > +                           env->seen_direct_write = seen_direct_write;
> > > > > > > +           }
> > > > > > 
> > > > > > [ ... ]
> > > > > > 
> > > > > > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct \
> > > > > > > bpf_verifier_env *env, struct bpf_kfunc_call_ return ret;
> > > > > > > break;
> > > > > > > case KF_ARG_PTR_TO_DYNPTR:
> > > > > > > +           {
> > > > > > > +                   enum bpf_arg_type dynptr_arg_type = \
> > > > > > > ARG_PTR_TO_DYNPTR; +
> > > > > > > if (reg->type != PTR_TO_STACK &&
> > > > > > > reg->type != CONST_PTR_TO_DYNPTR) {
> > > > > > > verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > > > > > return -EINVAL;
> > > > > > > }
> > > > > > > -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > > > > > -                                             ARG_PTR_TO_DYNPTR | \
> > > > > > > MEM_RDONLY); +                   if (meta->func_id == \
> > > > > > > special_kfunc_list[KF_bpf_dynptr_from_skb]) +                           \
> > > > > > > dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB; +                   \
> > > > > > > else +                           dynptr_arg_type |= MEM_RDONLY;
> > > > > > > +
> > > > > > > +                   ret = process_dynptr_func(env, regno, insn_idx, \
> > > > > > > dynptr_arg_type, +                                             \
> > > > > > > meta->func_id); if (ret < 0)
> > > > > > > return ret;
> > > > > > > break;
> > > > > > > +           }
> > > > > > > case KF_ARG_PTR_TO_LIST_HEAD:
> > > > > > > if (reg->type != PTR_TO_MAP_VALUE &&
> > > > > > > reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > > > > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct \
> > > > > > > bpf_verifier_env *env, struct bpf_insn *insn, desc->func_id == \
> > > > > > > special_kfunc_list[KF_bpf_rdonly_cast]) { insn_buf[0] = \
> > > > > > >                 BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > > > > > *cnt = 1;
> > > > > > > +   } else if (desc->func_id == \
> > > > > > > special_kfunc_list[KF_bpf_dynptr_from_skb]) { +           bool \
> > > > > > > is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > > > > 
> > > > > > Does it need to restore the env->seen_direct_write here also?
> > > > > > 
> > > > > > It seems this 'seen_direct_write' saving/restoring is needed now because
> > > > > > 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > > > > > actually writing the packet. Some refactoring can help to avoid issue \
> > > > > > like this.
> > > > > > 
> > > > > > While at 'seen_direct_write', Alexei has also pointed out that the \
> > > > > > verifier needs to track whether the (packet) 'slice' returned by \
> > > > > > bpf_dynptr_data() has been written. It should be tracked in \
> > > > > > 'seen_direct_write'. Take a look at how reg_is_pkt_pointer() and \
> > > > > > may_access_direct_pkt_data() are done in check_mem_access(). iirc, this \
> > > > > > reg_is_pkt_pointer() part got loss somewhere in v5 (or v4?) when \
> > > > > > bpf_dynptr_data() was changed to return register typed PTR_TO_MEM instead \
> > > > > > of PTR_TO_PACKET.
> > > > > 
> > > > > btw tc progs are using gen_prologue() approach because data/data_end are \
> > > > > not kfuncs (nothing is being called by the bpf prog).
> > > > > In this case we don't need to repeat this approach. If so we don't need to
> > > > > set seen_direct_write.
> > > > > Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > > > > And technically we don't need to limit it to skb head. It can handle any \
> > > > > off/len. It will work for skb, but there is no equivalent for \
> > > > > xdp_pull_data(). I don't think we can implement xdp_pull_data in all \
> > > > > drivers. That's massive amount of work, but we need to be consistent if we \
> > > > > want dynptr to wrap both skb and xdp.
> > > > > We can say dynptr_data is for head only, but we've seen bugs where people
> > > > > had to switch from data/data_end to load_bytes.
> > > > > 
> > > > > Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > > > > the packet calling that in bpf_dynptr_data is a heavy hammer.
> > > > > 
> > > > > It feels that we need to go back to skb_header_pointer-like discussion.
> > > > > Something like:
> > > > > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void \
> > > > > *buffer) Whether buffer is a part of dynptr or program provided is tbd.
> > > > 
> > > > making it hidden within dynptr would make this approach unreliable
> > > > (memory allocations, which can fail, etc). But if we ask users to pass
> > > > it directly, then it should be relatively easy to use in practice with
> > > > some pre-allocated per-CPU buffer:
> 
> bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation.
> I like preallocated approach more, so we're in agreement here.
> 
> > > > 
> > > > 
> > > > struct {
> > > > __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > > > __int(max_entries, 1);
> > > > __type(key, int);
> > > > __type(value, char[4096]);
> > > > } scratch SEC(".maps");
> > > > 
> > > > 
> > > > ...
> > > > 
> > > > 
> > > > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > > > void *p, *buf;
> > > > int zero = 0;
> > > > 
> > > > buf = bpf_map_lookup_elem(&scratch, &zero);
> > > > if (!buf) return 0; /* can't happen */
> > > > 
> > > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > > if (p == NULL) {
> > > > /* out of range */
> > > > } else {
> > > > /* work with p directly */
> > > > }
> > > > 
> > > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > > if (p == buf) {
> > > > bpf_dynptr_write(dp, buf, 16);
> > > > }
> > > > 
> > > > 
> > > > We'll just need to teach verifier to make sure that buf is at least 16
> > > > byte long.
> > > 
> > > A fifth __sz arg may do:
> > > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
> > > *buffer, u32 buffer__sz);
> > 
> > We'll need to make sure that buffer__sz is >= len (or preferably not
> > require extra size at all). We can check that at runtime, of course,
> > but rejecting too small buffer at verification time would be a better
> > experience.
> 
> I don't follow. Why two equivalent 'len' args ?
> Just to allow 'len' to be a variable instead of constant ?
> It's unusual for the verifier to have 'len' before 'buffer',
> but this is fixable.
> 
> How about adding 'rd_only vs rdwr' flag ?
> Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier
> and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned.
> if (rd_only) return skb_header_pointer()
> if (rdwr) bpf_try_make_writable(); return skb->data + off;
> and final bpf_dynptr_write() is not needed.
> 
> But that doesn't work for xdp, since there is no pull.
> 
> It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
> Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
> But that defeats Andrii's goal to use dynptr as a generic wrapper.
> skb is quite special.

If it's the common case that skbs use the same flag across writes in
their bpf prog, then we can have bpf_dynptr_from_skb take in
BPF_F_RECOMPUTE_CSUM/BPF_F_INVALIDATE_HASH in its flags arg and then
always apply this when the skb does a write to packet data.

> 
> Maybe something like:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
> void *buffer, u32 buffer__sz)
> {
> if (skb_cloned()) {
> skb_copy_bits(skb, offset, buffer, len);
> return buffer;
> }
> return skb_header_pointer(...);
> }
> 
> When prog is just parsing the packet it doesn't need to finalize with \
> bpf_dynptr_write. The prog can always write into the pointer followed by if (p == \
> buf) bpf_dynptr_write. No need for rdonly flag, but extra copy is there in case of \
> cloned which could have been avoided with extra rd_only flag.

We're able to track in the verifier whether the slice gets written to
or not, so if it does get written to in the skb case, can't we just
add in a call to bpf_try_make_writable() as a post-processing fixup
that gets called before bpf_dynptr_slice? Then bpf_dynptr_slice() can
just return a directly writable ptr and avoid the extra memcpy

> 
> In case of xdp it will be:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
> void *buffer, u32 buffer__sz)
> {
> ptr = bpf_xdp_pointer(xdp, offset, len);
> if (ptr)
> return ptr;
> bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */
> return buffer;
> }
> 
> bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */
> 
> > > 
> > > The bpf prog usually has buffer in the stack for the common small header \
> > > parsing.
> > 
> > sure, that would work for small chunks
> > 
> > > 
> > > One side note is the bpf_dynptr_slice() still needs to check if the skb is
> > > cloned or not even the off/len is within the head range.
[...]


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic