1. 20 Apr, 2023 4 commits
  2. 19 Apr, 2023 1 commit
  3. 18 Apr, 2023 7 commits
  4. 17 Apr, 2023 4 commits
    • Yonghong Song's avatar
      selftests/bpf: Add a selftest for checking subreg equality · 49859de9
      Yonghong Song authored
      Add a selftest to ensure subreg equality if source register
      upper 32bit is 0. Without previous patch, the test will
      fail verification.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      49859de9
    • Yonghong Song's avatar
      bpf: Improve verifier u32 scalar equality checking · 3be49f79
      Yonghong Song authored
      In [1], I tried to remove bpf-specific codes to prevent certain
      llvm optimizations, and add llvm TTI (target transform info) hooks
      to prevent those optimizations. During this process, I found
      if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode
      transformation, I will hit the following verification failure with selftests:
      
        ...
        8: (18) r1 = 0xffffc900001b2230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
        10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
        13: (bc) w2 = w1                      ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (test < __NR_TESTS)
        14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0
        ;
        16: (27) r2 *= 28                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
        17: (18) r3 = 0xffffc900001b2118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
        19: (0f) r3 += r2                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
        20: (61) r2 = *(u32 *)(r3 +0)
        R3 unbounded memory access, make sure to bounds check any such access
        processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6
        -- END PROG LOAD LOG --
        libbpf: prog 'ingress_fwdns_prio100': failed to load: -13
        libbpf: failed to load object 'test_tc_dtime'
        libbpf: failed to load BPF skeleton 'test_tc_dtime': -13
        ...
      
      At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary
      u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains
      as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if
      the previous mov is alu32 ('w2 = w1').
      
      If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality
      after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value
      is 0, so it is safe to claim r1/r2 equality. This patch exactly did this.
      For a 32bit subreg mov, if the src register upper 32bit is 0,
      it is okay to claim equality between src and dst registers.
      
      With this patch, the above verification sequence becomes
      
        ...
        8: (18) r1 = 0xffffc9000048e230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
        10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
        13: (bc) w2 = w1                      ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (test < __NR_TESTS)
        14: (a6) if w1 < 0x9 goto pc+1        ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff))
        ...
        from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0
        16: (27) r2 *= 28                     ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc))
        17: (18) r3 = 0xffffc9000048e118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
        19: (0f) r3 += r2
        20: (61) r2 = *(u32 *)(r3 +0)         ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252)
        ...
      
      and eventually the bpf program can be verified successfully.
      
        [1] https://reviews.llvm.org/D147968Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3be49f79
    • Sean Young's avatar
      bpf: lirc program type should not require SYS_CAP_ADMIN · 69a8c792
      Sean Young authored
      Make it possible to load lirc program type with just CAP_BPF. There is
      nothing exceptional about lirc programs that means they require
      SYS_CAP_ADMIN.
      
      In order to attach or detach a lirc program type you need permission to
      open /dev/lirc0; if you have permission to do that, you can alter all
      sorts of lirc receiving options. Changing the IR protocol decoder is no
      different.
      
      Right now on a typical distribution /dev/lirc devices are only
      read/write by root. Ideally we would make them group read/write like
      other devices so that local users can use them without becoming root.
      Signed-off-by: default avatarSean Young <sean@mess.org>
      Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69a8c792
    • Daniel Borkmann's avatar
      bpf: Set skb redirect and from_ingress info in __bpf_tx_skb · 59e498a3
      Daniel Borkmann authored
      There are some use-cases where it is desirable to use bpf_redirect()
      in combination with ifb device, which currently is not supported, for
      example, around filtering inbound traffic with BPF to then push it to
      ifb which holds the qdisc for shaping in contrast to doing that on the
      egress device.
      
      Toke mentions the following case related to OpenWrt:
      
         Because there's not always a single egress on the other side. These are
         mainly home routers, which tend to have one or more WiFi devices bridged
         to one or more ethernet ports on the LAN side, and a single upstream WAN
         port. And the objective is to control the total amount of traffic going
         over the WAN link (in both directions), to deal with bufferbloat in the
         ISP network (which is sadly still all too prevalent).
      
         In this setup, the traffic can be split arbitrarily between the links
         on the LAN side, and the only "single bottleneck" is the WAN link. So we
         install both egress and ingress shapers on this, configured to something
         like 95-98% of the true link bandwidth, thus moving the queues into the
         qdisc layer in the router. It's usually necessary to set the ingress
         bandwidth shaper a bit lower than the egress due to being "downstream"
         of the bottleneck link, but it does work surprisingly well.
      
         We usually use something like a matchall filter to put all ingress
         traffic on the ifb, so doing the redirect from BPF has not been an
         immediate requirement thus far. However, it does seem a bit odd that
         this is not possible, and we do have a BPF-based filter that layers on
         top of this kind of setup, which currently uses u32 as the ingress
         filter and so it could presumably be improved to use BPF instead if
         that was available.
      Reported-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Reported-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Reported-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
      Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
      Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59e498a3
  5. 16 Apr, 2023 14 commits
    • Alexei Starovoitov's avatar
      Merge branch 'Remove KF_KPTR_GET kfunc flag' · d40f4f68
      Alexei Starovoitov authored
      David Vernet says:
      
      ====================
      
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the existing use cases which previously had KF_KPTR_GET
      kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
      have all been updated to be synchronized using RCU. In other words,
      their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
      KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
      maps in an RCU read region thanks to the types being RCU safe.
      
      While KF_KPTR_GET was a logical starting point for kptrs, it's become
      clear that they're not the correct abstraction. KF_KPTR_GET is a flag
      that essentially does nothing other than enforcing that the argument to
      a function is a pointer to a referenced kptr map value. At first glance,
      that's a useful thing to guarantee to a kfunc. It gives kfuncs the
      ability to try and acquire a reference on that kptr without requiring
      the BPF prog to do something like this:
      
      struct kptr_type *in_map, *new = NULL;
      
      in_map = bpf_kptr_xchg(&map->value, NULL);
      if (in_map) {
      	new = bpf_kptr_type_acquire(in_map);
      	in_map = bpf_kptr_xchg(&map->value, in_map);
      	if (in_map)
      		bpf_kptr_type_release(in_map);
      }
      
      That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
      the only alternative, it's better than nothing. However, the problem
      with any KF_KPTR_GET kfunc lies in the fact that it always requires some
      kind of synchronization in order to safely do an opportunistic acquire
      of the kptr in the map. This is because a BPF program running on another
      CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
      after it's been read by the KF_KPTR_GET kfunc. For example, the
      now-removed bpf_task_kptr_get() kfunc did the following:
      
      struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
      {
      	    struct task_struct *p;
      
      	rcu_read_lock();
      	p = READ_ONCE(*pp);
      	/* If p is non-NULL, it could still be freed by another CPU,
       	 * so we have to do an opportunistic refcount_inc_not_zero()
      	 * and return NULL if the task will be freed after the
      	 * current RCU read region.
      	 */
      	|f (p && !refcount_inc_not_zero(&p->rcu_users))
      		p = NULL;
      	rcu_read_unlock();
      
      	return p;
      }
      
      In other words, the kfunc uses RCU to ensure that the task remains valid
      after it's been peeked from the map. However, this is completely
      redundant with just defining a KF_RCU kfunc that itself does a
      refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
      does.
      
      So, the question of whether KF_KPTR_GET is useful is actually, "Are
      there any synchronization mechanisms / safety flags that are required by
      certain kptrs, but which are not provided by the verifier to kfuncs?"
      The answer to that question today is "No", because every kptr we
      currently care about is RCU protected.
      
      Even if the answer ever became "yes", the proper way to support that
      referenced kptr type would be to add support for whatever
      synchronization mechanism it requires in the verifier, rather than
      giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
      in a map, do whatever you need to do."
      
      With all that said -- so as to allow us to consolidate the kfunc API,
      and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc
      flag.
      ---
      
      This is v2 of this patchset
      
      v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/
      
      Changelog:
      ----------
      
      v1 -> v2:
      - Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover
        letter (Alexei)
      - In order to reduce churn, don't shift all KF_* flags down by 1. We'll
        just fill the now-empty slot the next time we add a flag (Alexei)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d40f4f68
    • David Vernet's avatar
      bpf,docs: Remove KF_KPTR_GET from documentation · 530474e6
      David Vernet authored
      A prior patch removed KF_KPTR_GET from the kernel. Now that it's no
      longer accessible to kfunc authors, this patch removes it from the BPF
      kfunc documentation.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      530474e6
    • David Vernet's avatar
      bpf: Remove KF_KPTR_GET kfunc flag · 7b4ddf39
      David Vernet authored
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the existing use cases which previously had KF_KPTR_GET
      kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
      have all been updated to be synchronized using RCU. In other words,
      their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
      KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
      maps in an RCU read region thanks to the types being RCU safe.
      
      While KF_KPTR_GET was a logical starting point for kptrs, it's become
      clear that they're not the correct abstraction. KF_KPTR_GET is a flag
      that essentially does nothing other than enforcing that the argument to
      a function is a pointer to a referenced kptr map value. At first glance,
      that's a useful thing to guarantee to a kfunc. It gives kfuncs the
      ability to try and acquire a reference on that kptr without requiring
      the BPF prog to do something like this:
      
      struct kptr_type *in_map, *new = NULL;
      
      in_map = bpf_kptr_xchg(&map->value, NULL);
      if (in_map) {
              new = bpf_kptr_type_acquire(in_map);
              in_map = bpf_kptr_xchg(&map->value, in_map);
              if (in_map)
                      bpf_kptr_type_release(in_map);
      }
      
      That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
      the only alternative, it's better than nothing. However, the problem
      with any KF_KPTR_GET kfunc lies in the fact that it always requires some
      kind of synchronization in order to safely do an opportunistic acquire
      of the kptr in the map. This is because a BPF program running on another
      CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
      after it's been read by the KF_KPTR_GET kfunc. For example, the
      now-removed bpf_task_kptr_get() kfunc did the following:
      
      struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
      {
                  struct task_struct *p;
      
              rcu_read_lock();
              p = READ_ONCE(*pp);
              /* If p is non-NULL, it could still be freed by another CPU,
               * so we have to do an opportunistic refcount_inc_not_zero()
               * and return NULL if the task will be freed after the
               * current RCU read region.
               */
              |f (p && !refcount_inc_not_zero(&p->rcu_users))
                      p = NULL;
              rcu_read_unlock();
      
              return p;
      }
      
      In other words, the kfunc uses RCU to ensure that the task remains valid
      after it's been peeked from the map. However, this is completely
      redundant with just defining a KF_RCU kfunc that itself does a
      refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
      does.
      
      So, the question of whether KF_KPTR_GET is useful is actually, "Are
      there any synchronization mechanisms / safety flags that are required by
      certain kptrs, but which are not provided by the verifier to kfuncs?"
      The answer to that question today is "No", because every kptr we
      currently care about is RCU protected.
      
      Even if the answer ever became "yes", the proper way to support that
      referenced kptr type would be to add support for whatever
      synchronization mechanism it requires in the verifier, rather than
      giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
      in a map, do whatever you need to do."
      
      With all that said -- so as to allow us to consolidate the kfunc API,
      and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all
      relevant logic from the verifier.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7b4ddf39
    • David Vernet's avatar
      bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc · 09b501d9
      David Vernet authored
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the prior main use cases, struct bpf_cpumask *, struct
      task_struct *, and struct cgroup *, have all been updated to be
      synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc
      calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be
      accessed in an RCU read region in BPF.
      
      In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag.
      This patch prepares for that by removing the
      bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09b501d9
    • Alexei Starovoitov's avatar
      Merge branch 'Shared ownership for local kptrs' · 7a0788fe
      Alexei Starovoitov authored
      Dave Marchevsky says:
      
      ====================
      
      This series adds support for refcounted local kptrs to the verifier. A local
      kptr is 'refcounted' if its type contains a struct bpf_refcount field:
      
        struct refcounted_node {
          long data;
          struct bpf_list_node ll;
          struct bpf_refcount ref;
        };
      
      bpf_refcount is used to implement shared ownership for local kptrs.
      
      Motivating usecase
      ==================
      
      If a struct has two collection node fields, e.g.:
      
        struct node {
          long key;
          long val;
          struct bpf_rb_node rb;
          struct bpf_list_node ll;
        };
      
      It's not currently possible to add a node to both the list and rbtree:
      
        long bpf_prog(void *ctx)
        {
          struct node *n = bpf_obj_new(typeof(*n));
          if (!n) { /* ... */ }
      
          bpf_spin_lock(&lock);
      
          bpf_list_push_back(&head, &n->ll);
          bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
          bpf_spin_unlock(&lock);
        }
      
      The above program will fail verification due to current owning / non-owning ref
      logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
      passed to bpf_rbtree_add. The only way to get an owning reference for the node
      that was added is to bpf_list_pop_{front,back} it.
      
      More generally, verifier ownership semantics expect that a node has one
      owner (program, collection, or stashed in map) with exclusive ownership
      of the node's lifetime. The owner free's the node's underlying memory when it
      itself goes away.
      
      Without a shared ownership concept it's impossible to express many real-world
      usecases such that they pass verification.
      
      Semantic Changes
      ================
      
      Before this series, the verifier could make this statement: "whoever has the
      owning reference has exclusive ownership of the referent's lifetime". As
      demonstrated in the previous section, this implies that a BPF program can't
      have an owning reference to some node if that node is in a collection. If
      such a state were possible, the node would have multiple owners, each thinking
      they have exclusive ownership. In order to support shared ownership it's
      necessary to modify the exclusive ownership semantic.
      
      After this series' changes, an owning reference has ownership of the referent's
      lifetime, but it's not necessarily exclusive. The referent's underlying memory
      is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
      used for collection insert.
      
      This change doesn't affect UX of owning or non-owning references much:
      
        * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
          an owning reference arg, as ownership still must be passed to the
          collection in a shared-ownership world.
      
        * non-owning references still refer to valid memory without claiming
          any ownership.
      
      One important conclusion that followed from "exclusive ownership" statement
      is no longer valid, though. In exclusive-ownership world, if a BPF prog has
      an owning reference to a node, the verifier can conclude that no collection has
      ownership of it. This conclusion was used to avoid runtime checking in the
      implementations of insert and remove operations (""has the node already been
      {inserted, removed}?").
      
      In a shared-ownership world the aforementioned conclusion is no longer valid,
      which necessitates doing runtime checking in insert and remove operation
      kfuncs, and those functions possibly failing to insert or remove anything.
      
      Luckily the verifier changes necessary to go from exclusive to shared ownership
      were fairly minimal. Patches in this series which do change verifier semantics
      generally have some summary dedicated to explaining why certain usecases
      Just Work for shared ownership without verifier changes.
      
      Implementation
      ==============
      
      The changes in this series can be categorized as follows:
      
        * struct bpf_refcount opaque field + plumbing
        * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
        * bpf_refcount_acquire kfunc
          * enables shared ownershp by bumping refcount + acquiring owning ref
        * support for possibly-failing collection insertion and removal
          * insertion changes are more complex
      
      If a patch's changes have some nuance to their effect - or lack of effect - on
      verifier behavior, the patch summary talks about it at length.
      
      Patch contents:
        * Patch 1 removes btf_field_offs struct
        * Patch 2 adds struct bpf_refcount and associated plumbing
        * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
          refcounted kptrs
        * Patch 4 adds bpf_refcount_acquire
        * Patches 5-7 add support for possibly-failing collection insert and remove
        * Patch 8 centralizes constructor-like functionality for local kptr types
        * Patch 9 adds tests for new functionality
      
      base-commit: 4a1e885c
      
      Changelog:
      
      v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com
      
      Patch #s used below refer to the patch's position in v1 unless otherwise
      specified.
      
        * General
          * Rebase onto latest bpf-next (base-commit updated above)
      
        * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
          * Fix typo in summary (Alexei)
        * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
          * Modify a paragraph in patch summary to more clearly state that only
            bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
            patch (Alexei)
          * refcount_off == -1 -> refcount_off < 0  in "node type w/ both list
            and rb_node fields" check, since any negative value means "no
            bpf_refcount field found", and furthermore refcount_off is never
            explicitly set to -1, but rather -EINVAL. (Alexei)
          * Instead of just changing "btf: list_node and rb_node in same struct" test
            expectation to pass instead of fail, do some refactoring to test both
            "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
            _no_ bpf_refcount" (failure) cases. This ensures that logic change in
            previous bullet point is correct.
            * v1's "btf: list_node and rb_node in same struct" test changes didn't
              add bpf_refcount, so the fact that btf load succeeded w/ list and
              rb_nodes but no bpf_refcount field is further proof that this logic
              was incorrect in v1.
        * Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
          * Instead of doing __init_field_infer_size in kfuncs when taking
            bpf_list_head type input which might've been 0-initialized in map, go
            back to simple oneliner initialization. Add short comment explaining why
            this is necessary. (Alexei)
        * Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
          * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7a0788fe
    • Dave Marchevsky's avatar
      selftests/bpf: Add refcounted_kptr tests · 6147f151
      Dave Marchevsky authored
      Test refcounted local kptr functionality added in previous patches in
      the series.
      
      Usecases which pass verification:
      
      * Add refcounted local kptr to both tree and list. Then, read and -
        possibly, depending on test variant - delete from tree, then list.
        * Also test doing read-and-maybe-delete in opposite order
      * Stash a refcounted local kptr in a map_value, then add it to a
        rbtree. Read from both, possibly deleting after tree read.
      * Add refcounted local kptr to both tree and list. Then, try reading and
        deleting twice from one of the collections.
      * bpf_refcount_acquire of just-added non-owning ref should work, as
        should bpf_refcount_acquire of owning ref just out of bpf_obj_new
      
      Usecases which fail verification:
      
      * The simple successful bpf_refcount_acquire cases from above should
        both fail to verify if the newly-acquired owning ref is not dropped
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-10-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6147f151
    • Dave Marchevsky's avatar
      bpf: Centralize btf_field-specific initialization logic · 3e81740a
      Dave Marchevsky authored
      All btf_fields in an object are 0-initialized by memset in
      bpf_obj_init. This might not be a valid initial state for some field
      types, in which case kfuncs that use the type will properly initialize
      their input if it's been 0-initialized. Some BPF graph collection types
      and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.
      
      An earlier patch in this series added the bpf_refcount field, for which
      the 0 state indicates that the refcounted object should be free'd.
      bpf_obj_init treats this field specially, setting refcount to 1 instead
      of relying on scattered "refcount is 0? Must have just been initialized,
      let's set to 1" logic in kfuncs.
      
      This patch extends this treatment to list and rbtree field types,
      allowing most scattered initialization logic in kfuncs to be removed.
      
      Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
      they'll be 0-initialized without passing through the newly-added logic,
      so scattered initialization logic must remain for these collection root
      types.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3e81740a
    • Dave Marchevsky's avatar
      bpf: Migrate bpf_rbtree_remove to possibly fail · 404ad75a
      Dave Marchevsky authored
      This patch modifies bpf_rbtree_remove to account for possible failure
      due to the input rb_node already not being in any collection.
      The function can now return NULL, and does when the aforementioned
      scenario occurs. As before, on successful removal an owning reference to
      the removed node is returned.
      
      Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL |
      KF_ACQUIRE - provides the desired verifier semantics:
      
        * retval must be checked for NULL before use
        * if NULL, retval's ref_obj_id is released
        * retval is a "maybe acquired" owning ref, not a non-owning ref,
          so it will live past end of critical section (bpf_spin_unlock), and
          thus can be checked for NULL after the end of the CS
      
      BPF programs must add checks
      ============================
      
      This does change bpf_rbtree_remove's verifier behavior. BPF program
      writers will need to add NULL checks to their programs, but the
      resulting UX looks natural:
      
        bpf_spin_lock(&glock);
      
        n = bpf_rbtree_first(&ghead);
        if (!n) { /* ... */}
        res = bpf_rbtree_remove(&ghead, &n->node);
      
        bpf_spin_unlock(&glock);
      
        if (!res)  /* Newly-added check after this patch */
          return 1;
      
        n = container_of(res, /* ... */);
        /* Do something else with n */
        bpf_obj_drop(n);
        return 0;
      
      The "if (!res)" check above is the only addition necessary for the above
      program to pass verification after this patch.
      
      bpf_rbtree_remove no longer clobbers non-owning refs
      ====================================================
      
      An issue arises when bpf_rbtree_remove fails, though. Consider this
      example:
      
        struct node_data {
          long key;
          struct bpf_list_node l;
          struct bpf_rb_node r;
          struct bpf_refcount ref;
        };
      
        long failed_sum;
      
        void bpf_prog()
        {
          struct node_data *n = bpf_obj_new(/* ... */);
          struct bpf_rb_node *res;
          n->key = 10;
      
          bpf_spin_lock(&glock);
      
          bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */
          res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */);
          if (!res)
            failed_sum += n->key;  /* not possible */
      
          bpf_spin_unlock(&glock);
          /* if (res) { do something useful and drop } ... */
        }
      
      The bpf_rbtree_remove in this example will always fail. Similarly to
      bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference
      invalidation point. The verifier clobbers all non-owning refs after a
      bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail
      verification, and in fact there's no good way to get information about
      the node which failed to add after the invalidation. This patch removes
      non-owning reference invalidation from bpf_rbtree_remove to allow the
      above usecase to pass verification. The logic for why this is now
      possible is as follows:
      
      Before this series, bpf_rbtree_add couldn't fail and thus assumed that
      its input, a non-owning reference, was in the tree. But it's easy to
      construct an example where two non-owning references pointing to the same
      underlying memory are acquired and passed to rbtree_remove one after
      another (see rbtree_api_release_aliasing in
      selftests/bpf/progs/rbtree_fail.c).
      
      So it was necessary to clobber non-owning refs to prevent this
      case and, more generally, to enforce "non-owning ref is definitely
      in some collection" invariant. This series removes that invariant and
      the failure / runtime checking added in this patch provide a clean way
      to deal with the aliasing issue - just fail to remove.
      
      Because the aliasing issue prevented by clobbering non-owning refs is no
      longer an issue, this patch removes the invalidate_non_owning_refs
      call from verifier handling of bpf_rbtree_remove. Note that
      bpf_spin_unlock - the other caller of invalidate_non_owning_refs -
      clobbers non-owning refs for a different reason, so its clobbering
      behavior remains unchanged.
      
      No BPF program changes are necessary for programs to remain valid as a
      result of this clobbering change. A valid program before this patch
      passed verification with its non-owning refs having shorter (or equal)
      lifetimes due to more aggressive clobbering.
      
      Also, update existing tests to check bpf_rbtree_remove retval for NULL
      where necessary, and move rbtree_api_release_aliasing from
      progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass
      verification.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      404ad75a
    • Dave Marchevsky's avatar
      selftests/bpf: Modify linked_list tests to work with macro-ified inserts · de67ba39
      Dave Marchevsky authored
      The linked_list tests use macros and function pointers to reduce code
      duplication. Earlier in the series, bpf_list_push_{front,back} were
      modified to be macros, expanding to invoke actual kfuncs
      bpf_list_push_{front,back}_impl. Due to this change, a code snippet
      like:
      
        void (*p)(void *, void *) = (void *)&bpf_list_##op;
        p(hexpr, nexpr);
      
      meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer
      work as it's no longer valid to do &bpf_list_push_{front,back} since
      they're no longer functions.
      
      This patch fixes issues of this type, along with two other minor changes
      - one improvement and one fix - both related to the node argument to
      list_push_{front,back}.
      
        * The fix: migration of list_push tests away from (void *, void *)
          func ptr uncovered that some tests were incorrectly passing pointer
          to node, not pointer to struct bpf_list_node within the node. This
          patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node))
      
        * The improvement: In linked_list tests, the struct foo type has two
          list_node fields: node and node2, at byte offsets 0 and 40 within
          the struct, respectively. Currently node is used in ~all tests
          involving struct foo and lists. The verifier needs to do some work
          to account for the offset of bpf_list_node within the node type, so
          using node2 instead of node exercises that logic more in the tests.
          This patch migrates linked_list tests to use node2 instead of node.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-7-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      de67ba39
    • Dave Marchevsky's avatar
      bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail · d2dcc67d
      Dave Marchevsky authored
      Consider this code snippet:
      
        struct node {
          long key;
          bpf_list_node l;
          bpf_rb_node r;
          bpf_refcount ref;
        }
      
        int some_bpf_prog(void *ctx)
        {
          struct node *n = bpf_obj_new(/*...*/), *m;
      
          bpf_spin_lock(&glock);
      
          bpf_rbtree_add(&some_tree, &n->r, /* ... */);
          m = bpf_refcount_acquire(n);
          bpf_rbtree_add(&other_tree, &m->r, /* ... */);
      
          bpf_spin_unlock(&glock);
      
          /* ... */
        }
      
      After bpf_refcount_acquire, n and m point to the same underlying memory,
      and that node's bpf_rb_node field is being used by the some_tree insert,
      so overwriting it as a result of the second insert is an error. In order
      to properly support refcounted nodes, the rbtree and list insert
      functions must be allowed to fail. This patch adds such support.
      
      The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
      return an int indicating success/failure, with 0 -> success, nonzero ->
      failure.
      
      bpf_obj_drop on failure
      =======================
      
      Currently the only reason an insert can fail is the example above: the
      bpf_{list,rb}_node is already in use. When such a failure occurs, the
      insert kfuncs will bpf_obj_drop the input node. This allows the insert
      operations to logically fail without changing their verifier owning ref
      behavior, namely the unconditional release_reference of the input
      owning ref.
      
      With insert that always succeeds, ownership of the node is always passed
      to the collection, since the node always ends up in the collection.
      
      With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
      is always passed either to the collection (success), or to bpf_obj_drop
      (failure). Regardless, it's correct to continue unconditionally
      releasing the input owning ref, as something is always taking ownership
      from the calling program on insert.
      
      Keeping owning ref behavior unchanged results in a nice default UX for
      insert functions that can fail. If the program's reaction to a failed
      insert is "fine, just get rid of this owning ref for me and let me go
      on with my business", then there's no reason to check for failure since
      that's default behavior. e.g.:
      
        long important_failures = 0;
      
        int some_bpf_prog(void *ctx)
        {
          struct node *n, *m, *o; /* all bpf_obj_new'd */
      
          bpf_spin_lock(&glock);
          bpf_rbtree_add(&some_tree, &n->node, /* ... */);
          bpf_rbtree_add(&some_tree, &m->node, /* ... */);
          if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
            important_failures++;
          }
          bpf_spin_unlock(&glock);
        }
      
      If we instead chose to pass ownership back to the program on failed
      insert - by returning NULL on success or an owning ref on failure -
      programs would always have to do something with the returned ref on
      failure. The most likely action is probably "I'll just get rid of this
      owning ref and go about my business", which ideally would look like:
      
        if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
          bpf_obj_drop(n);
      
      But bpf_obj_drop isn't allowed in a critical section and inserts must
      occur within one, so in reality error handling would become a
      hard-to-parse mess.
      
      For refcounted nodes, we can replicate the "pass ownership back to
      program on failure" logic with this patch's semantics, albeit in an ugly
      way:
      
        struct node *n = bpf_obj_new(/* ... */), *m;
      
        bpf_spin_lock(&glock);
      
        m = bpf_refcount_acquire(n);
        if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
          /* Do something with m */
        }
      
        bpf_spin_unlock(&glock);
        bpf_obj_drop(m);
      
      bpf_refcount_acquire is used to simulate "return owning ref on failure".
      This should be an uncommon occurrence, though.
      
      Addition of two verifier-fixup'd args to collection inserts
      ===========================================================
      
      The actual bpf_obj_drop kfunc is
      bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
      macro populating the second arg with 0 and the verifier later filling in
      the arg during insn fixup.
      
      Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
      bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
      passed to bpf_obj_drop_impl.
      
      Similarly, because the 'node' param to those insert functions is the
      bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
      pointer to the beginning of the node, the insert functions need to be
      able to find the beginning of the node struct. A second
      verifier-populated param is necessary: the offset of {list,rb}_node within the
      node type.
      
      These two new params allow the insert kfuncs to correctly call
      __bpf_obj_drop_impl:
      
        beginning_of_node = bpf_rb_node_ptr - offset
        if (already_inserted)
          __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);
      
      Similarly to other kfuncs with "hidden" verifier-populated params, the
      insert functions are renamed with _impl prefix and a macro is provided
      for common usage. For example, bpf_rbtree_add kfunc is now
      bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
      "hidden" args to 0.
      
      Due to the two new args BPF progs will need to be recompiled to work
      with the new _impl kfuncs.
      
      This patch also rewrites the "hidden argument" explanation to more
      directly say why the BPF program writer doesn't need to populate the
      arguments with anything meaningful.
      
      How does this new logic affect non-owning references?
      =====================================================
      
      Currently, non-owning refs are valid until the end of the critical
      section in which they're created. We can make this guarantee because, if
      a non-owning ref exists, the referent was added to some collection. The
      collection will drop() its nodes when it goes away, but it can't go away
      while our program is accessing it, so that's not a problem. If the
      referent is removed from the collection in the same CS that it was added
      in, it can't be bpf_obj_drop'd until after CS end. Those are the only
      two ways to free the referent's memory and neither can happen until
      after the non-owning ref's lifetime ends.
      
      On first glance, having these collection insert functions potentially
      bpf_obj_drop their input seems like it breaks the "can't be
      bpf_obj_drop'd until after CS end" line of reasoning. But we care about
      the memory not being _freed_ until end of CS end, and a previous patch
      in the series modified bpf_obj_drop such that it doesn't free refcounted
      nodes until refcount == 0. So the statement can be more accurately
      rewritten as "can't be free'd until after CS end".
      
      We can prove that this rewritten statement holds for any non-owning
      reference produced by collection insert functions:
      
      * If the input to the insert function is _not_ refcounted
        * We have an owning reference to the input, and can conclude it isn't
          in any collection
          * Inserting a node in a collection turns owning refs into
            non-owning, and since our input type isn't refcounted, there's no
            way to obtain additional owning refs to the same underlying
            memory
        * Because our node isn't in any collection, the insert operation
          cannot fail, so bpf_obj_drop will not execute
        * If bpf_obj_drop is guaranteed not to execute, there's no risk of
          memory being free'd
      
      * Otherwise, the input to the insert function is refcounted
        * If the insert operation fails due to the node's list_head or rb_root
          already being in some collection, there was some previous successful
          insert which passed refcount to the collection
        * We have an owning reference to the input, it must have been
          acquired via bpf_refcount_acquire, which bumped the refcount
        * refcount must be >= 2 since there's a valid owning reference and the
          node is already in a collection
        * Insert triggering bpf_obj_drop will decr refcount to >= 1, never
          resulting in a free
      
      So although we may do bpf_obj_drop during the critical section, this
      will never result in memory being free'd, and no changes to non-owning
      ref logic are needed in this patch.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d2dcc67d
    • Dave Marchevsky's avatar
      bpf: Add bpf_refcount_acquire kfunc · 7c50b1cb
      Dave Marchevsky authored
      Currently, BPF programs can interact with the lifetime of refcounted
      local kptrs in the following ways:
      
        bpf_obj_new  - Initialize refcount to 1 as part of new object creation
        bpf_obj_drop - Decrement refcount and free object if it's 0
        collection add - Pass ownership to the collection. No change to
                         refcount but collection is responsible for
      		   bpf_obj_dropping it
      
      In order to be able to add a refcounted local kptr to multiple
      collections we need to be able to increment the refcount and acquire a
      new owning reference. This patch adds a kfunc, bpf_refcount_acquire,
      implementing such an operation.
      
      bpf_refcount_acquire takes a refcounted local kptr and returns a new
      owning reference to the same underlying memory as the input. The input
      can be either owning or non-owning. To reinforce why this is safe,
      consider the following code snippets:
      
        struct node *n = bpf_obj_new(typeof(*n)); // A
        struct node *m = bpf_refcount_acquire(n); // B
      
      In the above snippet, n will be alive with refcount=1 after (A), and
      since nothing changes that state before (B), it's obviously safe. If
      n is instead added to some rbtree, we can still safely refcount_acquire
      it:
      
        struct node *n = bpf_obj_new(typeof(*n));
        struct node *m;
      
        bpf_spin_lock(&glock);
        bpf_rbtree_add(&groot, &n->node, less);   // A
        m = bpf_refcount_acquire(n);              // B
        bpf_spin_unlock(&glock);
      
      In the above snippet, after (A) n is a non-owning reference, and after
      (B) m is an owning reference pointing to the same memory as n. Although
      n has no ownership of that memory's lifetime, it's guaranteed to be
      alive until the end of the critical section, and n would be clobbered if
      we were past the end of the critical section, so it's safe to bump
      refcount.
      
      Implementation details:
      
      * From verifier's perspective, bpf_refcount_acquire handling is similar
        to bpf_obj_new and bpf_obj_drop. Like the former, it returns a new
        owning reference matching input type, although like the latter, type
        can be inferred from concrete kptr input. Verifier changes in
        {check,fixup}_kfunc_call and check_kfunc_args are largely copied from
        aforementioned functions' verifier changes.
      
      * An exception to the above is the new KF_ARG_PTR_TO_REFCOUNTED_KPTR
        arg, indicated by new "__refcounted_kptr" kfunc arg suffix. This is
        necessary in order to handle both owning and non-owning input without
        adding special-casing to "__alloc" arg handling. Also a convenient
        place to confirm that input type has bpf_refcount field.
      
      * The implemented kfunc is actually bpf_refcount_acquire_impl, with
        'hidden' second arg that the verifier sets to the type's struct_meta
        in fixup_kfunc_call.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-5-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7c50b1cb
    • Dave Marchevsky's avatar
      bpf: Support refcounted local kptrs in existing semantics · 1512217c
      Dave Marchevsky authored
      A local kptr is considered 'refcounted' when it is of a type that has a
      bpf_refcount field. When such a kptr is created, its refcount should be
      initialized to 1; when destroyed, the object should be free'd only if a
      refcount decr results in 0 refcount.
      
      Existing logic always frees the underlying memory when destroying a
      local kptr, and 0-initializes all btf_record fields. This patch adds
      checks for "is local kptr refcounted?" and new logic for that case in
      the appropriate places.
      
      This patch focuses on changing existing semantics and thus conspicuously
      does _not_ provide a way for BPF programs in increment refcount. That
      follows later in the series.
      
      __bpf_obj_drop_impl is modified to do the right thing when it sees a
      refcounted type. Container types for graph nodes (list, tree, stashed in
      map) are migrated to use __bpf_obj_drop_impl as a destructor for their
      nodes instead of each having custom destruction code in their _free
      paths. Now that "drop" isn't a synonym for "free" when the type is
      refcounted it makes sense to centralize this logic.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1512217c
    • Dave Marchevsky's avatar
      bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing · d54730b5
      Dave Marchevsky authored
      A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
      meant for use in BPF programs. Similarly to other opaque types like
      bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
      user-defined struct types a bpf_refcount can be located, so necessary
      btf_record plumbing is added to enable this. bpf_refcount is sized to
      hold a refcount_t.
      
      Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
      btf_record as refcount_off in addition to being in the field array.
      Caching refcount_off makes sense for this field because further patches
      in the series will modify functions that take local kptrs (e.g.
      bpf_obj_drop) to change their behavior if the type they're operating on
      is refcounted. So enabling fast "is this type refcounted?" checks is
      desirable.
      
      No such verifier behavior changes are introduced in this patch, just
      logic to recognize 'struct bpf_refcount' in btf_record.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d54730b5
    • Dave Marchevsky's avatar
      bpf: Remove btf_field_offs, use btf_record's fields instead · cd2a8079
      Dave Marchevsky authored
      The btf_field_offs struct contains (offset, size) for btf_record fields,
      sorted by offset. btf_field_offs is always used in conjunction with
      btf_record, which has btf_field 'fields' array with (offset, type), the
      latter of which btf_field_offs' size is derived from via
      btf_field_type_size.
      
      This patch adds a size field to struct btf_field and sorts btf_record's
      fields by offset, making it possible to get rid of btf_field_offs. Less
      data duplication and less code complexity results.
      
      Since btf_field_offs' lifetime closely followed the btf_record used to
      populate it, most complexity wins are from removal of initialization
      code like:
      
        if (btf_record_successfully_initialized) {
          foffs = btf_parse_field_offs(rec);
          if (IS_ERR_OR_NULL(foffs))
            // free the btf_record and return err
        }
      
      Other changes in this patch are pretty mechanical:
      
        * foffs->field_off[i] -> rec->fields[i].offset
        * foffs->field_sz[i] -> rec->fields[i].size
        * Sort rec->fields in btf_parse_fields before returning
          * It's possible that this is necessary independently of other
            changes in this patch. btf_record_find in syscall.c expects
            btf_record's fields to be sorted by offset, yet there's no
            explicit sorting of them before this patch, record's fields are
            populated in the order they're read from BTF struct definition.
            BTF docs don't say anything about the sortedness of struct fields.
        * All functions taking struct btf_field_offs * input now instead take
          struct btf_record *. All callsites of these functions already have
          access to the correct btf_record.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd2a8079
  6. 14 Apr, 2023 5 commits
    • Rong Tao's avatar
      samples/bpf: sampleip: Replace PAGE_OFFSET with _text address · 4a1e885c
      Rong Tao authored
      Macro PAGE_OFFSET(0xffff880000000000) in sampleip_user.c is inaccurate,
      for example, in aarch64 architecture, this value depends on the
      CONFIG_ARM64_VA_BITS compilation configuration, this value defaults to 48,
      the corresponding PAGE_OFFSET is 0xffff800000000000, if we use the value
      defined in sampleip_user.c, then all KSYMs obtained by sampleip are (user)
      
      Symbol error due to PAGE_OFFSET error:
      
          $ sudo ./sampleip 1
          Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends.
          ADDR                KSYM                             COUNT
          0xffff80000810ceb8  (user)                           1
          0xffffb28ec880      (user)                           1
          0xffff8000080c82b8  (user)                           1
          0xffffb23fed24      (user)                           1
          0xffffb28944fc      (user)                           1
          0xffff8000084628bc  (user)                           1
          0xffffb2a935c0      (user)                           1
          0xffff80000844677c  (user)                           1
          0xffff80000857a3a4  (user)                           1
          ...
      
      A few examples of addresses in the CONFIG_ARM64_VA_BITS=48 environment in
      the aarch64 environment:
      
          $ sudo head /proc/kallsyms
          ffff8000080a0000 T _text
          ffff8000080b0000 t gic_handle_irq
          ffff8000080b0000 T _stext
          ffff8000080b0000 T __irqentry_text_start
          ffff8000080b00b0 t gic_handle_irq
          ffff8000080b0230 t gic_handle_irq
          ffff8000080b03b4 T __irqentry_text_end
          ffff8000080b03b8 T __softirqentry_text_start
          ffff8000080b03c0 T __do_softirq
          ffff8000080b0718 T __entry_text_start
      
      We just need to replace the PAGE_OFFSET with the address _text in
      /proc/kallsyms to solve this problem:
      
          $ sudo ./sampleip 1
          Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends.
          ADDR                KSYM                             COUNT
          0xffffb2892ab0      (user)                           1
          0xffffb2b1edfc      (user)                           1
          0xffff800008462834  __arm64_sys_ppoll                1
          0xffff8000084b87f4  eventfd_read                     1
          0xffffb28e6788      (user)                           1
          0xffff8000081e96d8  rcu_all_qs                       1
          0xffffb2ada878      (user)                           1
          ...
      Signed-off-by: default avatarRong Tao <rongtao@cestc.cn>
      Link: https://lore.kernel.org/r/tencent_A0E82E0BEE925285F8156D540731DF805F05@qq.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4a1e885c
    • Ilya Leoshkevich's avatar
      bpf: Support 64-bit pointers to kfuncs · 1cf3bfc6
      Ilya Leoshkevich authored
      test_ksyms_module fails to emit a kfunc call targeting a module on
      s390x, because the verifier stores the difference between kfunc
      address and __bpf_call_base in bpf_insn.imm, which is s32, and modules
      are roughly (1 << 42) bytes away from the kernel on s390x.
      
      Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs,
      and storing the absolute address in bpf_kfunc_desc.
      
      Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new
      behavior to the s390x JIT. Otherwise other JITs need to be modified,
      which is not desired.
      
      Introduce bpf_get_kfunc_addr() instead of exposing both
      find_kfunc_desc() and struct bpf_kfunc_desc.
      
      In addition to sorting kfuncs by imm, also sort them by offset, in
      order to handle conflicting imms from different modules. Do this on
      all architectures in order to simplify code.
      
      Factor out resolving specialized kfuncs (XPD and dynptr) from
      fixup_kfunc_call(). This was required in the first place, because
      fixup_kfunc_call() uses find_kfunc_desc(), which returns a const
      pointer, so it's not possible to modify kfunc addr without stripping
      const, which is not nice. It also removes repetition of code like:
      
      	if (bpf_jit_supports_far_kfunc_call())
      		desc->addr = func;
      	else
      		insn->imm = BPF_CALL_IMM(func);
      
      and separates kfunc_desc_tab fixups from kfunc_call fixups.
      Suggested-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1cf3bfc6
    • Yafang's avatar
      bpf: Add preempt_count_{sub,add} into btf id deny list · c11bd046
      Yafang authored
      The recursion check in __bpf_prog_enter* and __bpf_prog_exit*
      leave preempt_count_{sub,add} unprotected. When attaching trampoline to
      them we get panic as follows,
      
      [  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28)
      [  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
      [  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
      [  867.843100] Call Trace:
      [  867.843101]  <TASK>
      [  867.843104]  asm_exc_int3+0x3a/0x40
      [  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
      [  867.843135]  __bpf_prog_enter_recur+0x17/0x90
      [  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
      [  867.843154]  ? preempt_count_sub+0x1/0xa0
      [  867.843157]  preempt_count_sub+0x5/0xa0
      [  867.843159]  ? migrate_enable+0xac/0xf0
      [  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
      [  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
      ...
      [  867.843788]  preempt_count_sub+0x5/0xa0
      [  867.843793]  ? migrate_enable+0xac/0xf0
      [  867.843829]  __bpf_prog_exit_recur+0x2d/0x40
      [  867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35)
      [  867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c)
      [  867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec)
      [  867.843842]  bpf_trampoline_6442468108_0+0x55/0x1000
      ...
      
      That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are
      called after prog->active is decreased.
      
      Fixing this by adding these two functions into btf ids deny list.
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarYafang <laoar.shao@gmail.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c11bd046
    • Alexei Starovoitov's avatar
      selftests/bpf: Workaround for older vm_sockets.h. · 75860b52
      Alexei Starovoitov authored
      Some distros ship with older vm_sockets.h that doesn't have VMADDR_CID_LOCAL
      which causes selftests build to fail:
      /tmp/work/bpf/bpf/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c:261:18: error: ‘VMADDR_CID_LOCAL’ undeclared (first use in this function); did you mean ‘VMADDR_CID_HOST’?
          261 |  addr->svm_cid = VMADDR_CID_LOCAL;
              |                  ^~~~~~~~~~~~~~~~
              |                  VMADDR_CID_HOST
      
      Workaround this issue by defining it on demand.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      75860b52
    • Alexei Starovoitov's avatar
      selftests/bpf: Fix merge conflict due to SYS() macro change. · c04135ab
      Alexei Starovoitov authored
      Fix merge conflict between bpf/bpf-next trees due to change of arguments in SYS() macro.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c04135ab
  7. 13 Apr, 2023 5 commits
    • Jakub Kicinski's avatar
      Daniel Borkmann says: · c2865b11
      Jakub Kicinski authored
      ====================
      pull-request: bpf-next 2023-04-13
      
      We've added 260 non-merge commits during the last 36 day(s) which contain
      a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).
      
      The main changes are:
      
      1) Rework BPF verifier log behavior and implement it as a rotating log
         by default with the option to retain old-style fixed log behavior,
         from Andrii Nakryiko.
      
      2) Adds support for using {FOU,GUE} encap with an ipip device operating
         in collect_md mode and add a set of BPF kfuncs for controlling encap
         params, from Christian Ehrig.
      
      3) Allow BPF programs to detect at load time whether a particular kfunc
         exists or not, and also add support for this in light skeleton,
         from Alexei Starovoitov.
      
      4) Optimize hashmap lookups when key size is multiple of 4,
         from Anton Protopopov.
      
      5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
         tasks to be stored in BPF maps, from David Vernet.
      
      6) Add support for stashing local BPF kptr into a map value via
         bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
         for new cgroups, from Dave Marchevsky.
      
      7) Fix BTF handling of is_int_ptr to skip modifiers to work around
         tracing issues where a program cannot be attached, from Feng Zhou.
      
      8) Migrate a big portion of test_verifier unit tests over to
         test_progs -a verifier_* via inline asm to ease {read,debug}ability,
         from Eduard Zingerman.
      
      9) Several updates to the instruction-set.rst documentation
         which is subject to future IETF standardization
         (https://lwn.net/Articles/926882/), from Dave Thaler.
      
      10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
          known bits information propagation, from Daniel Borkmann.
      
      11) Add skb bitfield compaction work related to BPF with the overall goal
          to make more of the sk_buff bits optional, from Jakub Kicinski.
      
      12) BPF selftest cleanups for build id extraction which stand on its own
          from the upcoming integration work of build id into struct file object,
          from Jiri Olsa.
      
      13) Add fixes and optimizations for xsk descriptor validation and several
          selftest improvements for xsk sockets, from Kal Conley.
      
      14) Add BPF links for struct_ops and enable switching implementations
          of BPF TCP cong-ctls under a given name by replacing backing
          struct_ops map, from Kui-Feng Lee.
      
      15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
          offset stack read as earlier Spectre checks cover this,
          from Luis Gerhorst.
      
      16) Fix issues in copy_from_user_nofault() for BPF and other tracers
          to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
          and Alexei Starovoitov.
      
      17) Add --json-summary option to test_progs in order for CI tooling to
          ease parsing of test results, from Manu Bretelle.
      
      18) Batch of improvements and refactoring to prep for upcoming
          bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
          from Martin KaFai Lau.
      
      19) Improve bpftool's visual program dump which produces the control
          flow graph in a DOT format by adding C source inline annotations,
          from Quentin Monnet.
      
      20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
          the module name from BTF of the target and searching kallsyms of
          the correct module, from Viktor Malik.
      
      21) Improve BPF verifier handling of '<const> <cond> <non_const>'
          to better detect whether in particular jmp32 branches are taken,
          from Yonghong Song.
      
      22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
          A built-in cc or one from a kernel module is already able to write
          to app_limited, from Yixin Shen.
      
      Conflicts:
      
      Documentation/bpf/bpf_devel_QA.rst
        b7abcd9c ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
        0f10f647 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
      https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/
      
      include/net/ip_tunnels.h
        bc9d003d ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
        ac931d4c ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
      https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/
      
      net/bpf/test_run.c
        e5995bc7 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
        294635a8 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
      https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
      ====================
      
      Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c2865b11
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 800e68c4
      Jakub Kicinski authored
      Conflicts:
      
      tools/testing/selftests/net/config
        62199e3f ("selftests: net: Add VXLAN MDB test")
        3a0385be ("selftests: add the missing CONFIG_IP_SCTP in net config")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      800e68c4
    • Linus Torvalds's avatar
      Merge tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 829cca4d
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, and bluetooth.
      
        Not all that quiet given spring celebrations, but "current" fixes are
        thinning out, which is encouraging. One outstanding regression in the
        mlx5 driver when using old FW, not blocking but we're pushing for a
        fix.
      
        Current release - new code bugs:
      
         - eth: enetc: workaround for unresponsive pMAC after receiving
           express traffic
      
        Previous releases - regressions:
      
         - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the
           pid/seq fields 0 for backward compatibility
      
        Previous releases - always broken:
      
         - sctp: fix a potential overflow in sctp_ifwdtsn_skip
      
         - mptcp:
            - use mptcp_schedule_work instead of open-coding it and make the
              worker check stricter, to avoid scheduling work on closed
              sockets
            - fix NULL pointer dereference on fastopen early fallback
      
         - skbuff: fix memory corruption due to a race between skb coalescing
           and releasing clones confusing page_pool reference counting
      
         - bonding: fix neighbor solicitation validation on backup slaves
      
         - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp
      
         - bpf: arm64: fixed a BTI error on returning to patched function
      
         - openvswitch: fix race on port output leading to inf loop
      
         - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid
           returning a different errno than expected
      
         - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove
      
         - Bluetooth: fix printing errors if LE Connection times out
      
         - Bluetooth: assorted UaF, deadlock and data race fixes
      
         - eth: macb: fix memory corruption in extended buffer descriptor mode
      
        Misc:
      
         - adjust the XDP Rx flow hash API to also include the protocol layers
           over which the hash was computed"
      
      * tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
        selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
        mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
        veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
        mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
        xdp: rss hash types representation
        selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
        skbuff: Fix a race between coalescing and releasing SKBs
        net: macb: fix a memory corruption in extended buffer descriptor mode
        selftests: add the missing CONFIG_IP_SCTP in net config
        udp6: fix potential access to stale information
        selftests: openvswitch: adjust datapath NL message declaration
        selftests: mptcp: userspace pm: uniform verify events
        mptcp: fix NULL pointer dereference on fastopen early fallback
        mptcp: stricter state check in mptcp_worker
        mptcp: use mptcp_schedule_work instead of open-coding it
        net: enetc: workaround for unresponsive pMAC after receiving express traffic
        sctp: fix a potential overflow in sctp_ifwdtsn_skip
        net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume()
        rtnetlink: Restore RTM_NEW/DELLINK notification behavior
        net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes
        ...
      829cca4d
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 4413ad01
      Linus Torvalds authored
      Pull devicetree fixes from Rob Herring:
      
       - Fix interaction between fw_devlink and DT overlays causing devices to
         not be probed
      
       - Fix the compatible string for loongson,cpu-interrupt-controller
      
      * tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        treewide: Fix probing of devices in DT overlays
        dt-bindings: interrupt-controller: loongarch: Fix mismatched compatible
      4413ad01
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 531f27ad
      Linus Torvalds authored
      Pull pin control fix from Linus Walleij:
       "This is just a revert of the AMD fix, because the fix broke some
        laptops. We are working on a proper solution"
      
      * tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        Revert "pinctrl: amd: Disable and mask interrupts on resume"
      531f27ad