1. 23 Mar, 2023 4 commits
    • Kui-Feng Lee's avatar
      bpf: Create links for BPF struct_ops maps. · 68b04864
      Kui-Feng Lee authored
      Make bpf_link support struct_ops.  Previously, struct_ops were always
      used alone without any associated links. Upon updating its value, a
      struct_ops would be activated automatically. Yet other BPF program
      types required to make a bpf_link with their instances before they
      could become active. Now, however, you can create an inactive
      struct_ops, and create a link to activate it later.
      
      With bpf_links, struct_ops has a behavior similar to other BPF program
      types. You can pin/unpin them from their links and the struct_ops will
      be deactivated when its link is removed while previously need someone
      to delete the value for it to be deactivated.
      
      bpf_links are responsible for registering their associated
      struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
      set to create a bpf_link, while a structs without this flag behaves in
      the same manner as before and is registered upon updating its value.
      
      The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
      used to craft the links for BPF struct_ops programs, but also to
      create links for BPF struct_ops them-self.  Since the links of BPF
      struct_ops programs are only used to create trampolines internally,
      they are never seen in other contexts. Thus, they can be reused for
      struct_ops themself.
      
      To maintain a reference to the map supporting this link, we add
      bpf_struct_ops_link as an additional type. The pointer of the map is
      RCU and won't be necessary until later in the patchset.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-4-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      68b04864
    • Kui-Feng Lee's avatar
      net: Update an existing TCP congestion control algorithm. · 8fb1a76a
      Kui-Feng Lee authored
      This feature lets you immediately transition to another congestion
      control algorithm or implementation with the same name.  Once a name
      is updated, new connections will apply this new algorithm.
      
      The purpose is to update a customized algorithm implemented in BPF
      struct_ops with a new version on the flight.  The following is an
      example of using the userspace API implemented in later BPF patches.
      
         link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
         .......
         err = bpf_link__update_map(link, skel->maps.ca_update_2);
      
      We first load and register an algorithm implemented in BPF struct_ops,
      then swap it out with a new one using the same name. After that, newly
      created connections will apply the updated algorithm, while older ones
      retain the previous version already applied.
      
      This patch also takes this chance to refactor the ca validation into
      the new tcp_validate_congestion_control() function.
      
      Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-3-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      8fb1a76a
    • Kui-Feng Lee's avatar
      bpf: Retire the struct_ops map kvalue->refcnt. · b671c206
      Kui-Feng Lee authored
      We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
      RCU grace period.
      
      Maintenance of kvalue->refcnt was a complicated task, as we had to
      simultaneously keep track of two reference counts: one for the
      reference count of bpf_map. When the kvalue->refcnt reaches zero, we
      also have to reduce the reference count on bpf_map - yet these steps
      are not performed in an atomic manner and require us to be vigilant
      when managing them. By eliminating kvalue->refcnt, we can make our
      maintenance more straightforward as the refcount of bpf_map is now
      solely managed!
      
      To prevent the trampoline image of a struct_ops from being released
      while it is still in use, we wait for an RCU grace period. The
      setsockopt(TCP_CONGESTION, "...") command allows you to change your
      socket's congestion control algorithm and can result in releasing the
      old struct_ops implementation. It is fine. However, this function is
      exposed through bpf_setsockopt(), it may be accessed by BPF programs
      as well. To ensure that the trampoline image belonging to struct_op
      can be safely called while its method is in use, the trampoline
      safeguarde the BPF program with rcu_read_lock(). Doing so prevents any
      destruction of the associated images before returning from a
      trampoline and requires us to wait for an RCU grace period.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-2-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      b671c206
    • Andrii Nakryiko's avatar
      bpf: remember meta->iter info only for initialized iters · b63cbc49
      Andrii Nakryiko authored
      For iter_new() functions iterator state's slot might not be yet
      initialized, in which case iter_get_spi() will return -ERANGE. This is
      expected and is handled properly. But for iter_next() and iter_destroy()
      cases iter slot is supposed to be initialized and correct, so -ERANGE is
      not possible.
      
      Move meta->iter.{spi,frameno} initialization into iter_next/iter_destroy
      handling branch to make it more explicit that valid information will be
      remembered in meta->iter block for subsequent use in process_iter_next_call(),
      avoiding confusingly looking -ERANGE assignment for meta->iter.spi.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230322232502.836171-1-andrii@kernel.orgSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      b63cbc49
  2. 22 Mar, 2023 11 commits
    • Xu Kuohai's avatar
      selftests/bpf: Check when bounds are not in the 32-bit range · 1a3148fc
      Xu Kuohai authored
      Add cases to check if bound is updated correctly when 64-bit value is
      not in the 32-bit range.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20230322213056.2470-2-daniel@iogearbox.net
      1a3148fc
    • Daniel Borkmann's avatar
      bpf: Fix __reg_bound_offset 64->32 var_off subreg propagation · 7be14c1c
      Daniel Borkmann authored
      Xu reports that after commit 3f50f132 ("bpf: Verifier, do explicit ALU32
      bounds tracking"), the following BPF program is rejected by the verifier:
      
         0: (61) r2 = *(u32 *)(r1 +0)          ; R2_w=pkt(off=0,r=0,imm=0)
         1: (61) r3 = *(u32 *)(r1 +4)          ; R3_w=pkt_end(off=0,imm=0)
         2: (bf) r1 = r2
         3: (07) r1 += 1
         4: (2d) if r1 > r3 goto pc+8
         5: (71) r1 = *(u8 *)(r2 +0)           ; R1_w=scalar(umax=255,var_off=(0x0; 0xff))
         6: (18) r0 = 0x7fffffffffffff10
         8: (0f) r1 += r0                      ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f)
         9: (18) r0 = 0x8000000000000000
        11: (07) r0 += 1
        12: (ad) if r0 < r1 goto pc-2
        13: (b7) r0 = 0
        14: (95) exit
      
      And the verifier log says:
      
        func#0 @0
        0: R1=ctx(off=0,imm=0) R10=fp0
        0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
        1: (61) r3 = *(u32 *)(r1 +4)          ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0)
        2: (bf) r1 = r2                       ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
        3: (07) r1 += 1                       ; R1_w=pkt(off=1,r=0,imm=0)
        4: (2d) if r1 > r3 goto pc+8          ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0)
        5: (71) r1 = *(u8 *)(r2 +0)           ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0)
        6: (18) r0 = 0x7fffffffffffff10       ; R0_w=9223372036854775568
        8: (0f) r1 += r0                      ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15)
        9: (18) r0 = 0x8000000000000000       ; R0_w=-9223372036854775808
        11: (07) r0 += 1                      ; R0_w=-9223372036854775807
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809)
        13: (b7) r0 = 0                       ; R0_w=0
        14: (95) exit
      
        from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775806
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff))
        13: safe
      
        [...]
      
        from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775794
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff))
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775793
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff))
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775792
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff))
        13: safe
      
        [...]
      
      The 64bit umin=9223372036854775810 bound continuously bumps by +1 while
      umax=9223372036854775823 stays as-is until the verifier complexity limit
      is reached and the program gets finally rejected. During this simulation,
      the umin also eventually surpasses umax. Looking at the first 'from 12
      to 11' output line from the loop, R1 has the following state:
      
        R1_w=scalar(umin=0x8000000000000002 (9223372036854775810),
                    umax=0x800000000000000f (9223372036854775823),
                var_off=(0x8000000000000000;
                                 0xffffffff))
      
      The var_off has technically not an inconsistent state but it's very
      imprecise and far off surpassing 64bit umax bounds whereas the expected
      output with refined known bits in var_off should have been like:
      
        R1_w=scalar(umin=0x8000000000000002 (9223372036854775810),
                    umax=0x800000000000000f (9223372036854775823),
                var_off=(0x8000000000000000;
                                        0xf))
      
      In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff)
      and does not converge into a narrower mask where more bits become known,
      eventually transforming R1 into a constant upon umin=9223372036854775823,
      umax=9223372036854775823 case where the verifier would have terminated and
      let the program pass.
      
      The __reg_combine_64_into_32() marks the subregister unknown and propagates
      64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within
      the 32bit universe. The question came up whether __reg_combine_64_into_32()
      should special case the situation that when 64bit {s,u}min bounds have
      the same value as 64bit {s,u}max bounds to then assign the latter as
      well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the
      above example however, that is just /one/ special case and not a /generic/
      solution given above example would still not be addressed this way and
      remain at an imprecise var_off=(0x8000000000000000; 0xffffffff).
      
      The improvement is needed in __reg_bound_offset() to refine var32_off with
      the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync()
      code first refines information about the register's min/max bounds via
      __update_reg_bounds() from the current var_off, then in __reg_deduce_bounds()
      from sign bit and with the potentially learned bits from bounds it'll
      update the var_off tnum in __reg_bound_offset(). For example, intersecting
      with the old var_off might have improved bounds slightly, e.g. if umax
      was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then
      result in (0; 0x7f...fc). The intersected var64_off holds then the
      universe which is a superset of var32_off. The point for the latter is
      not to broaden, but to further refine known bits based on the intersection
      of var_off with 32 bit bounds, so that we later construct the final var_off
      from upper and lower 32 bits. The final __update_reg_bounds() can then
      potentially still slightly refine bounds if more bits became known from the
      new var_off.
      
      After the improvement, we can see R1 converging successively:
      
        func#0 @0
        0: R1=ctx(off=0,imm=0) R10=fp0
        0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
        1: (61) r3 = *(u32 *)(r1 +4)          ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0)
        2: (bf) r1 = r2                       ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
        3: (07) r1 += 1                       ; R1_w=pkt(off=1,r=0,imm=0)
        4: (2d) if r1 > r3 goto pc+8          ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0)
        5: (71) r1 = *(u8 *)(r2 +0)           ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0)
        6: (18) r0 = 0x7fffffffffffff10       ; R0_w=9223372036854775568
        8: (0f) r1 += r0                      ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15)
        9: (18) r0 = 0x8000000000000000       ; R0_w=-9223372036854775808
        11: (07) r0 += 1                      ; R0_w=-9223372036854775807
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809)
        13: (b7) r0 = 0                       ; R0_w=0
        14: (95) exit
      
        from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775806
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775806 R1_w=-9223372036854775806
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775805
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775805 R1_w=-9223372036854775805
        13: safe
      
        [...]
      
        from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775797
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775797 R1=-9223372036854775797
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775796
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775796 R1=-9223372036854775796
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775795
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775795 R1=-9223372036854775795
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775794
        12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775794 R1=-9223372036854775794
        13: safe
      
        from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        11: (07) r0 += 1                      ; R0_w=-9223372036854775793
        12: (ad) if r0 < r1 goto pc-2
        last_idx 12 first_idx 12
        parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        last_idx 11 first_idx 11
        regs=1 stack=0 before 11: (07) r0 += 1
        parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
        last_idx 12 first_idx 0
        regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2
        regs=1 stack=0 before 11: (07) r0 += 1
        regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2
        regs=1 stack=0 before 11: (07) r0 += 1
        regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2
        regs=1 stack=0 before 11: (07) r0 += 1
        regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000
        last_idx 12 first_idx 12
        parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
        last_idx 11 first_idx 11
        regs=2 stack=0 before 11: (07) r0 += 1
        parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
        last_idx 12 first_idx 0
        regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2
        regs=2 stack=0 before 11: (07) r0 += 1
        regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2
        regs=2 stack=0 before 11: (07) r0 += 1
        regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2
        regs=2 stack=0 before 11: (07) r0 += 1
        regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000
        regs=2 stack=0 before 8: (0f) r1 += r0
        regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10
        regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0)
        13: safe
      
        from 4 to 13: safe
        verification time 322 usec
        stack depth 0
        processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1
      
      This also fixes up a test case along with this improvement where we match
      on the verifier log. The updated log now has a refined var_off, too.
      
      Fixes: 3f50f132 ("bpf: Verifier, do explicit ALU32 bounds tracking")
      Reported-by: default avatarXu Kuohai <xukuohai@huaweicloud.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20230314203424.4015351-2-xukuohai@huaweicloud.com
      Link: https://lore.kernel.org/bpf/20230322213056.2470-1-daniel@iogearbox.net
      7be14c1c
    • Alexei Starovoitov's avatar
      Merge branch 'error checking where helpers call bpf_map_ops' · 02adf9e9
      Alexei Starovoitov authored
      JP Kobryn says:
      
      ====================
      
      Within bpf programs, the bpf helper functions can make inline calls to
      kernel functions. In this scenario there can be a disconnect between the
      register the kernel function writes a return value to and the register the
      bpf program uses to evaluate that return value.
      
      As an example, this bpf code:
      
      long err = bpf_map_update_elem(...);
      if (err && err != -EEXIST)
      	// got some error other than -EEXIST
      
      ...can result in the bpf assembly:
      
      ; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
        37:	movabs $0xffff976a10730400,%rdi
        41:	mov    $0x1,%ecx
        46:	call   0xffffffffe103291c	; htab_map_update_elem
      ; if (err && err != -EEXIST) {
        4b:	cmp    $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
        4f:	je     0x000000000000008e
        51:	test   %rax,%rax
        54:	je     0x000000000000008e
      
      The compare operation here evaluates %rax, while in the preceding call to
      htab_map_update_elem the corresponding assembly returns -EEXIST via %eax
      (the lower 32 bits of %rax):
      
      movl $0xffffffef, %r9d
      ...
      movl %r9d, %eax
      
      ...since it's returning int (32-bit). So the resulting comparison becomes:
      
      cmp $0xffffffffffffffef, $0x00000000ffffffef
      
      ...making it not possible to check for negative errors or specific errors,
      since the sign value is left at the 32nd bit. It means in the original
      example, the conditional branch will be entered even when the error is
      -EEXIST, which was not intended.
      
      The selftests added cover these cases for the different bpf_map_ops
      functions. When the second patch is applied, changing the return type of
      those functions to long, the comparison works as intended and the tests
      pass.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02adf9e9
    • JP Kobryn's avatar
      bpf: return long from bpf_map_ops funcs · d7ba4cc9
      JP Kobryn authored
      This patch changes the return types of bpf_map_ops functions to long, where
      previously int was returned. Using long allows for bpf programs to maintain
      the sign bit in the absence of sign extension during situations where
      inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
      error is returned.
      
      The definitions of the helper funcs are generated from comments in the bpf
      uapi header at `include/uapi/linux/bpf.h`. The return type of these
      helpers was previously changed from int to long in commit bdb7b79b. For
      any case where one of the map helpers call the bpf_map_ops funcs that are
      still returning 32-bit int, a compiler might not include sign extension
      instructions to properly convert the 32-bit negative value a 64-bit
      negative value.
      
      For example:
      bpf assembly excerpt of an inlined helper calling a kernel function and
      checking for a specific error:
      
      ; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
        ...
        46:	call   0xffffffffe103291c	; htab_map_update_elem
      ; if (err && err != -EEXIST) {
        4b:	cmp    $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
      
      kernel function assembly excerpt of return value from
      `htab_map_update_elem` returning 32-bit int:
      
      movl $0xffffffef, %r9d
      ...
      movl %r9d, %eax
      
      ...results in the comparison:
      cmp $0xffffffffffffffef, $0x00000000ffffffef
      
      Fixes: bdb7b79b ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
      Tested-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarJP Kobryn <inwardvessel@gmail.com>
      Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d7ba4cc9
    • JP Kobryn's avatar
      bpf/selftests: coverage for bpf_map_ops errors · 830154cd
      JP Kobryn authored
      These tests expose the issue of being unable to properly check for errors
      returned from inlined bpf map helpers that make calls to the bpf_map_ops
      functions. At best, a check for zero or non-zero can be done but these
      tests show it is not possible to check for a negative value or for a
      specific error value.
      Signed-off-by: default avatarJP Kobryn <inwardvessel@gmail.com>
      Tested-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20230322194754.185781-2-inwardvessel@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      830154cd
    • Andrii Nakryiko's avatar
      Merge branch 'bpf: Support ksym detection in light skeleton.' · d9d93f3b
      Andrii Nakryiko authored
      Alexei Starovoitov says:
      
      ====================
      
      From: Alexei Starovoitov <ast@kernel.org>
      
      v1->v2: update denylist on s390
      
      Patch 1: Cleanup internal libbpf names.
      Patch 2: Teach the verifier that rdonly_mem != NULL.
      Patch 3: Fix gen_loader to support ksym detection.
      Patch 4: Selftest and update denylist.
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      d9d93f3b
    • Alexei Starovoitov's avatar
      selftests/bpf: Add light skeleton test for kfunc detection. · 3b2ec214
      Alexei Starovoitov authored
      Add light skeleton test for kfunc detection and denylist it for s390.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230321203854.3035-5-alexei.starovoitov@gmail.com
      3b2ec214
    • Alexei Starovoitov's avatar
      libbpf: Support kfunc detection in light skeleton. · 708cdc57
      Alexei Starovoitov authored
      Teach gen_loader to find {btf_id, btf_obj_fd} of kernel variables and kfuncs
      and populate corresponding ld_imm64 and bpf_call insns.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230321203854.3035-4-alexei.starovoitov@gmail.com
      708cdc57
    • Alexei Starovoitov's avatar
      bpf: Teach the verifier to recognize rdonly_mem as not null. · 1057d299
      Alexei Starovoitov authored
      Teach the verifier to recognize PTR_TO_MEM | MEM_RDONLY as not NULL
      otherwise if (!bpf_ksym_exists(known_kfunc)) doesn't go through
      dead code elimination.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/bpf/20230321203854.3035-3-alexei.starovoitov@gmail.com
      1057d299
    • Alexei Starovoitov's avatar
      libbpf: Rename RELO_EXTERN_VAR/FUNC. · a18f7214
      Alexei Starovoitov authored
      RELO_EXTERN_VAR/FUNC names are not correct anymore. RELO_EXTERN_VAR represent
      ksym symbol in ld_imm64 insn. It can point to kernel variable or kfunc.
      Rename RELO_EXTERN_VAR->RELO_EXTERN_LD64 and RELO_EXTERN_FUNC->RELO_EXTERN_CALL
      to match what they actually represent.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/bpf/20230321203854.3035-2-alexei.starovoitov@gmail.com
      a18f7214
    • Tushar Vyavahare's avatar
      selftests/xsk: add xdp populate metadata test · 9a321fd3
      Tushar Vyavahare authored
      Add a new test in copy-mode for testing the copying of metadata from the
      buffer in kernel-space to user-space. This is accomplished by adding a
      new XDP program and using the bss map to store a counter that is written
      to the metadata field. This counter is incremented for every packet so
      that the number becomes unique and should be the same as the payload. It
      is store in the bss so the value can be reset between runs.
      
      The XDP program populates the metadata and the userspace program checks
      the value stored in the metadata field against the payload using the new
      is_metadata_correct() function. To turn this verification on or off, add
      a new parameter (use_metadata) to the ifobject structure.
      Signed-off-by: default avatarTushar Vyavahare <tushar.vyavahare@intel.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20230320102705.306187-1-tushar.vyavahare@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9a321fd3
  3. 21 Mar, 2023 4 commits
  4. 20 Mar, 2023 3 commits
  5. 18 Mar, 2023 1 commit
  6. 17 Mar, 2023 11 commits
    • Manu Bretelle's avatar
      selftests/bpf: Add --json-summary option to test_progs · 2be7aa76
      Manu Bretelle authored
      Currently, test_progs outputs all stdout/stderr as it runs, and when it
      is done, prints a summary.
      
      It is non-trivial for tooling to parse that output and extract meaningful
      information from it.
      
      This change adds a new option, `--json-summary`/`-J` that let the caller
      specify a file where `test_progs{,-no_alu32}` can write a summary of the
      run in a json format that can later be parsed by tooling.
      
      Currently, it creates a summary section with successes/skipped/failures
      followed by a list of failed tests and subtests.
      
      A test contains the following fields:
      - name: the name of the test
      - number: the number of the test
      - message: the log message that was printed by the test.
      - failed: A boolean indicating whether the test failed or not. Currently
      we only output failed tests, but in the future, successful tests could
      be added.
      - subtests: A list of subtests associated with this test.
      
      A subtest contains the following fields:
      - name: same as above
      - number: sanme as above
      - message: the log message that was printed by the subtest.
      - failed: same as above but for the subtest
      
      An example run and json content below:
      ```
      $ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print
      $1","}' | tr -d '\n') -j -J /tmp/test_progs.json
      $ jq < /tmp/test_progs.json | head -n 30
      {
        "success": 29,
        "success_subtest": 23,
        "skipped": 3,
        "failed": 28,
        "results": [
          {
            "name": "bpf_cookie",
            "number": 10,
            "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n",
            "failed": true,
            "subtests": [
              {
                "name": "multi_kprobe_link_api",
                "number": 2,
                "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
                "failed": true
              },
              {
                "name": "multi_kprobe_attach_api",
                "number": 3,
                "message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
                "failed": true
              },
              {
                "name": "lsm",
                "number": 8,
                "message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n",
                "failed": true
              }
      ```
      
      The file can then be used to print a summary of the test run and list of
      failing tests/subtests:
      
      ```
      $ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"'
      
      Success: 29/23, Skipped: 3, Failed: 28
      $ jq -r < /tmp/test_progs.json '.results | map([
          if .failed then "#\(.number) \(.name)" else empty end,
          (
              . as {name: $tname, number: $tnum} | .subtests | map(
                  if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end
              )
          )
      ]) | flatten | .[]' | head -n 20
       #10 bpf_cookie
       #10/2 bpf_cookie/multi_kprobe_link_api
       #10/3 bpf_cookie/multi_kprobe_attach_api
       #10/8 bpf_cookie/lsm
       #15 bpf_mod_race
       #15/1 bpf_mod_race/ksym (used_btfs UAF)
       #15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF)
       #36 cgroup_hierarchical_stats
       #61 deny_namespace
       #61/1 deny_namespace/unpriv_userns_create_no_bpf
       #73 fexit_stress
       #83 get_func_ip_test
       #99 kfunc_dynptr_param
       #99/1 kfunc_dynptr_param/dynptr_data_null
       #99/4 kfunc_dynptr_param/dynptr_data_null
       #100 kprobe_multi_bench_attach
       #100/1 kprobe_multi_bench_attach/kernel
       #100/2 kprobe_multi_bench_attach/modules
       #101 kprobe_multi_test
       #101/1 kprobe_multi_test/skel_api
      ```
      Signed-off-by: default avatarManu Bretelle <chantr4@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230317163256.3809328-1-chantr4@gmail.com
      2be7aa76
    • Andrii Nakryiko's avatar
      Merge branch 'bpf: Add detection of kfuncs.' · 6cae5a71
      Andrii Nakryiko authored
      Alexei Starovoitov says:
      
      ====================
      
      From: Alexei Starovoitov <ast@kernel.org>
      
      Allow BPF programs detect at load time whether particular kfunc exists.
      
      Patch 1: Allow ld_imm64 to point to kfunc in the kernel.
      Patch 2: Fix relocation of kfunc in ld_imm64 insn when kfunc is in kernel module.
      Patch 3: Introduce bpf_ksym_exists() macro.
      Patch 4: selftest.
      
      NOTE: detection of kfuncs from light skeleton is not supported yet.
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      6cae5a71
    • Alexei Starovoitov's avatar
      selftests/bpf: Add test for bpf_ksym_exists(). · 95fdf6e3
      Alexei Starovoitov authored
      Add load and run time test for bpf_ksym_exists() and check that the verifier
      performs dead code elimination for non-existing kfunc.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-5-alexei.starovoitov@gmail.com
      95fdf6e3
    • Alexei Starovoitov's avatar
      libbpf: Introduce bpf_ksym_exists() macro. · 5cbd3fe3
      Alexei Starovoitov authored
      Introduce bpf_ksym_exists() macro that can be used by BPF programs
      to detect at load time whether particular ksym (either variable or kfunc)
      is present in the kernel.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-4-alexei.starovoitov@gmail.com
      5cbd3fe3
    • Alexei Starovoitov's avatar
      libbpf: Fix relocation of kfunc ksym in ld_imm64 insn. · 5fc13ad5
      Alexei Starovoitov authored
      void *p = kfunc; -> generates ld_imm64 insn.
      kfunc() -> generates bpf_call insn.
      
      libbpf patches bpf_call insn correctly while only btf_id part of ld_imm64 is
      set in the former case. Which means that pointers to kfuncs in modules are not
      patched correctly and the verifier rejects load of such programs due to btf_id
      being out of range. Fix libbpf to patch ld_imm64 for kfunc.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-3-alexei.starovoitov@gmail.com
      5fc13ad5
    • Alexei Starovoitov's avatar
      bpf: Allow ld_imm64 instruction to point to kfunc. · 58aa2afb
      Alexei Starovoitov authored
      Allow ld_imm64 insn with BPF_PSEUDO_BTF_ID to hold the address of kfunc. The
      ld_imm64 pointing to a valid kfunc will be seen as non-null PTR_TO_MEM by
      is_branch_taken() logic of the verifier, while libbpf will resolve address to
      unknown kfunc as ld_imm64 reg, 0 which will also be recognized by
      is_branch_taken() and the verifier will proceed dead code elimination. BPF
      programs can use this logic to detect at load time whether kfunc is present in
      the kernel with bpf_ksym_exists() macro that is introduced in the next patches.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-2-alexei.starovoitov@gmail.com
      58aa2afb
    • Bagas Sanjaya's avatar
      bpf, docs: Use internal linking for link to netdev subsystem doc · 0f10f647
      Bagas Sanjaya authored
      Commit d56b0c46 ("bpf, docs: Fix link to netdev-FAQ target")
      attempts to fix linking problem to undefined "netdev-FAQ" label
      introduced in 287f4fa9 ("docs: Update references to netdev-FAQ")
      by changing internal cross reference to netdev subsystem documentation
      (Documentation/process/maintainer-netdev.rst) to external one at
      docs.kernel.org. However, the linking problem is still not
      resolved, as the generated link points to non-existent netdev-FAQ
      section of the external doc, which when clicked, will instead going
      to the top of the doc.
      
      Revert back to internal linking by simply mention the doc path while
      massaging the leading text to the link, since the netdev subsystem
      doc contains no FAQs but rather general information about the subsystem.
      
      Fixes: d56b0c46 ("bpf, docs: Fix link to netdev-FAQ target")
      Fixes: 287f4fa9 ("docs: Update references to netdev-FAQ")
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230314074449.23620-1-bagasdotme@gmail.com
      0f10f647
    • Viktor Malik's avatar
      kallsyms, bpf: Move find_kallsyms_symbol_value out of internal header · bd5314f8
      Viktor Malik authored
      Moving find_kallsyms_symbol_value from kernel/module/internal.h to
      include/linux/module.h. The reason is that internal.h is not prepared to
      be included when CONFIG_MODULES=n. find_kallsyms_symbol_value is used by
      kernel/bpf/verifier.c and including internal.h from it (without modules)
      leads into a compilation error:
      
        In file included from ../include/linux/container_of.h:5,
                         from ../include/linux/list.h:5,
                         from ../include/linux/timer.h:5,
                         from ../include/linux/workqueue.h:9,
                         from ../include/linux/bpf.h:10,
                         from ../include/linux/bpf-cgroup.h:5,
                         from ../kernel/bpf/verifier.c:7:
        ../kernel/bpf/../module/internal.h: In function 'mod_find':
        ../include/linux/container_of.h:20:54: error: invalid use of undefined type 'struct module'
           20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
              |                                                      ^~
        [...]
      
      This patch fixes the above error.
      
      Fixes: 31bf1dbc ("bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarViktor Malik <vmalik@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/oe-kbuild-all/202303161404.OrmfCy09-lkp@intel.com/
      Link: https://lore.kernel.org/bpf/20230317095601.386738-1-vmalik@redhat.com
      bd5314f8
    • Alexei Starovoitov's avatar
      Merge branch 'double-fix bpf_test_run + XDP_PASS recycling' · 94bbbdfb
      Alexei Starovoitov authored
      Alexander Lobakin says:
      
      ====================
      
      Enabling skb PP recycling revealed a couple issues in the bpf_test_run
      code. Recycling broke the assumption that the headroom won't ever be
      touched during the test_run execution: xdp_scrub_frame() invalidates the
      XDP frame at the headroom start, while neigh xmit code overwrites 2 bytes
      to the left of the Ethernet header. The first makes the kernel panic in
      certain cases, while the second breaks xdp_do_redirect selftest on BE.
      test_run is a limited-scope entity, so let's hope no more corner cases
      will happen here or at least they will be as easy and pleasant to fix
      as those two.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      94bbbdfb
    • Alexander Lobakin's avatar
      selftests/bpf: fix "metadata marker" getting overwritten by the netstack · 5640b6d8
      Alexander Lobakin authored
      Alexei noticed xdp_do_redirect test on BPF CI started failing on
      BE systems after skb PP recycling was enabled:
      
      test_xdp_do_redirect:PASS:prog_run 0 nsec
      test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec
      test_xdp_do_redirect:PASS:pkt_count_zero 0 nsec
      test_xdp_do_redirect:FAIL:pkt_count_tc unexpected pkt_count_tc: actual
      220 != expected 9998
      test_max_pkt_size:PASS:prog_run_max_size 0 nsec
      test_max_pkt_size:PASS:prog_run_too_big 0 nsec
      close_netns:PASS:setns 0 nsec
       #289 xdp_do_redirect:FAIL
      Summary: 270/1674 PASSED, 30 SKIPPED, 1 FAILED
      
      and it doesn't happen on LE systems.
      Ilya then hunted it down to:
      
       #0  0x0000000000aaeee6 in neigh_hh_output (hh=0x83258df0,
      skb=0x88142200) at linux/include/net/neighbour.h:503
       #1  0x0000000000ab2cda in neigh_output (skip_cache=false,
      skb=0x88142200, n=<optimized out>) at linux/include/net/neighbour.h:544
       #2  ip6_finish_output2 (net=net@entry=0x88edba00, sk=sk@entry=0x0,
      skb=skb@entry=0x88142200) at linux/net/ipv6/ip6_output.c:134
       #3  0x0000000000ab4cbc in __ip6_finish_output (skb=0x88142200, sk=0x0,
      net=0x88edba00) at linux/net/ipv6/ip6_output.c:195
       #4  ip6_finish_output (net=0x88edba00, sk=0x0, skb=0x88142200) at
      linux/net/ipv6/ip6_output.c:206
      
      xdp_do_redirect test places a u32 marker (0x42) right before the Ethernet
      header to check it then in the XDP program and return %XDP_ABORTED if it's
      not there. Neigh xmit code likes to round up hard header length to speed
      up copying the header, so it overwrites two bytes in front of the Eth
      header. On LE systems, 0x42 is one byte at `data - 4`, while on BE it's
      `data - 1`, what explains why it happens only there.
      It didn't happen previously due to that %XDP_PASS meant the page will be
      discarded and replaced by a new one, but now it can be recycled as well,
      while bpf_test_run code doesn't reinitialize the content of recycled
      pages. This mark is limited to this particular test and its setup though,
      so there's no need to predict 1000 different possible cases. Just move
      it 4 bytes to the left, still keeping it 32 bit to match on more bytes.
      
      Fixes: 9c94bbf9 ("xdp: recycle Page Pool backed skbs built from XDP frames")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/CAADnVQ+B_JOU+EpP=DKhbY9yXdN6GiRPnpTTXfEZ9sNkUeb-yQ@mail.gmail.com
      Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> # + debugging
      Link: https://lore.kernel.org/bpf/8341c1d9f935f410438e79d3bd8a9cc50aefe105.camel@linux.ibm.comSigned-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230316175051.922550-3-aleksander.lobakin@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5640b6d8
    • Alexander Lobakin's avatar
      bpf, test_run: fix crashes due to XDP frame overwriting/corruption · e5995bc7
      Alexander Lobakin authored
      syzbot and Ilya faced the splats when %XDP_PASS happens for bpf_test_run
      after skb PP recycling was enabled for {__,}xdp_build_skb_from_frame():
      
      BUG: kernel NULL pointer dereference, address: 0000000000000d28
      RIP: 0010:memset_erms+0xd/0x20 arch/x86/lib/memset_64.S:66
      [...]
      Call Trace:
       <TASK>
       __finalize_skb_around net/core/skbuff.c:321 [inline]
       __build_skb_around+0x232/0x3a0 net/core/skbuff.c:379
       build_skb_around+0x32/0x290 net/core/skbuff.c:444
       __xdp_build_skb_from_frame+0x121/0x760 net/core/xdp.c:622
       xdp_recv_frames net/bpf/test_run.c:248 [inline]
       xdp_test_run_batch net/bpf/test_run.c:334 [inline]
       bpf_test_run_xdp_live+0x1289/0x1930 net/bpf/test_run.c:362
       bpf_prog_test_run_xdp+0xa05/0x14e0 net/bpf/test_run.c:1418
      [...]
      
      This happens due to that it calls xdp_scrub_frame(), which nullifies
      xdpf->data. bpf_test_run code doesn't reinit the frame when the XDP
      program doesn't adjust head or tail. Previously, %XDP_PASS meant the
      page will be released from the pool and returned to the MM layer, but
      now it does return to the Pool with the nullified xdpf->data, which
      doesn't get reinitialized then.
      So, in addition to checking whether the head and/or tail have been
      adjusted, check also for a potential XDP frame corruption. xdpf->data
      is 100% affected and also xdpf->flags is the field closest to the
      metadata / frame start. Checking for these two should be enough for
      non-extreme cases.
      
      Fixes: 9c94bbf9 ("xdp: recycle Page Pool backed skbs built from XDP frames")
      Reported-by: syzbot+e1d1b65f7c32f2a86a9f@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/bpf/000000000000f1985705f6ef2243@google.comReported-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/e07dd94022ad5731705891b9487cc9ed66328b94.camel@linux.ibm.comSigned-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230316175051.922550-2-aleksander.lobakin@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e5995bc7
  7. 16 Mar, 2023 6 commits
    • Luis Gerhorst's avatar
      bpf: Remove misleading spec_v1 check on var-offset stack read · 082cdc69
      Luis Gerhorst authored
      For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals()
      ensures that the resulting pointer has a constant offset if
      bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds()
      which in turn calls check_stack_access_for_ptr_arithmetic(). There,
      -EACCESS is returned if the register's offset is not constant, thereby
      rejecting the program.
      
      In summary, an unprivileged user must never be able to create stack
      pointers with a variable offset. That is also the case, because a
      respective check in check_stack_write() is missing. If they were able
      to create a variable-offset pointer, users could still use it in a
      stack-write operation to trigger unsafe speculative behavior [1].
      
      Because unprivileged users must already be prevented from creating
      variable-offset stack pointers, viable options are to either remove
      this check (replacing it with a clarifying comment), or to turn it
      into a "verifier BUG"-message, also adding a similar check in
      check_stack_write() (for consistency, as a second-level defense).
      This patch implements the first option to reduce verifier bloat.
      
      This check was introduced by commit 01f810ac ("bpf: Allow
      variable-offset stack access") which correctly notes that
      "variable-offset reads and writes are disallowed (they were already
      disallowed for the indirect access case) because the speculative
      execution checking code doesn't support them". However, it does not
      further discuss why the check in check_stack_read() is necessary.
      The code which made this check obsolete was also introduced in this
      commit.
      
      I have compiled ~650 programs from the Linux selftests, Linux samples,
      Cilium, and libbpf/examples projects and confirmed that none of these
      trigger the check in check_stack_read() [2]. Instead, all of these
      programs are, as expected, already rejected when constructing the
      variable-offset pointers. Note that the check in
      check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the
      code removed by this patch does not (the error removed does not appear
      in the "verification_error" values). For reproducibility, the
      repository linked includes the raw data and scripts used to create
      the plot.
      
        [1] https://arxiv.org/pdf/1807.03757.pdf
        [2] https://gitlab.cs.fau.de/un65esoq/bpf-spectre/-/raw/53dc19fcf459c186613b1156a81504b39c8d49db/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf?inline=false
      
      Fixes: 01f810ac ("bpf: Allow variable-offset stack access")
      Signed-off-by: default avatarLuis Gerhorst <gerhorst@cs.fau.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de
      082cdc69
    • Alexei Starovoitov's avatar
      Merge branch 'Make struct bpf_cpumask RCU safe' · deb9fd64
      Alexei Starovoitov authored
      David Vernet says:
      
      ====================
      
      The struct bpf_cpumask type is currently not RCU safe. It uses the
      bpf_mem_cache_{alloc,free}() APIs to allocate and release cpumasks, and
      those allocations may be reused before an RCU grace period has elapsed.
      We want to be able to enable using this pattern in BPF programs:
      
      private(MASK) static struct bpf_cpumask __kptr *global;
      
      int BPF_PROG(prog, ...)
      {
      	struct bpf_cpumask *cpumask;
      
      	bpf_rcu_read_lock();
      	cpumask = global;
      	if (!cpumask) {
      		bpf_rcu_read_unlock();
      		return -1;
      	}
      	bpf_cpumask_setall(cpumask);
      	...
      	bpf_rcu_read_unlock();
      }
      
      In other words, to be able to pass a kptr to KF_RCU bpf_cpumask kfuncs
      without requiring the acquisition and release of refcounts using
      bpf_cpumask_kptr_get(). This patchset enables this by making the struct
      bpf_cpumask type RCU safe, and removing the bpf_cpumask_kptr_get()
      function.
      ---
      v1: https://lore.kernel.org/all/20230316014122.678082-2-void@manifault.com/
      
      Changelog:
      ----------
      v1 -> v2:
      - Add doxygen comment for new @rcu field in struct bpf_cpumask.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      deb9fd64
    • David Vernet's avatar
      bpf,docs: Remove bpf_cpumask_kptr_get() from documentation · fec2c6d1
      David Vernet authored
      Now that the kfunc no longer exists, we can remove it and instead
      describe how RCU can be used to get a struct bpf_cpumask from a map
      value. This patch updates the BPF documentation accordingly.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230316054028.88924-6-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fec2c6d1
    • David Vernet's avatar
      bpf: Remove bpf_cpumask_kptr_get() kfunc · 1b403ce7
      David Vernet authored
      Now that struct bpf_cpumask is RCU safe, there's no need for this kfunc.
      Rather than doing the following:
      
      private(MASK) static struct bpf_cpumask __kptr *global;
      
      int BPF_PROG(prog, s32 cpu, ...)
      {
      	struct bpf_cpumask *cpumask;
      
      	bpf_rcu_read_lock();
      	cpumask = bpf_cpumask_kptr_get(&global);
      	if (!cpumask) {
      		bpf_rcu_read_unlock();
      		return -1;
      	}
      	bpf_cpumask_setall(cpumask);
      	...
      	bpf_cpumask_release(cpumask);
      	bpf_rcu_read_unlock();
      }
      
      Programs can instead simply do (assume same global cpumask):
      
      int BPF_PROG(prog, ...)
      {
      	struct bpf_cpumask *cpumask;
      
      	bpf_rcu_read_lock();
      	cpumask = global;
      	if (!cpumask) {
      		bpf_rcu_read_unlock();
      		return -1;
      	}
      	bpf_cpumask_setall(cpumask);
      	...
      	bpf_rcu_read_unlock();
      }
      
      In other words, no extra atomic acquire / release, and less boilerplate
      code.
      
      This patch removes both the kfunc, as well as its selftests and
      documentation.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230316054028.88924-5-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b403ce7
    • David Vernet's avatar
      bpf/selftests: Test using global cpumask kptr with RCU · a5a197df
      David Vernet authored
      Now that struct bpf_cpumask * is considered an RCU-safe type according
      to the verifier, we should add tests that validate its common usages.
      This patch adds those tests to the cpumask test suite. A subsequent
      changes will remove bpf_cpumask_kptr_get(), and will adjust the selftest
      and BPF documentation accordingly.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230316054028.88924-4-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a5a197df
    • David Vernet's avatar
      bpf: Mark struct bpf_cpumask as rcu protected · 63d2d83d
      David Vernet authored
      struct bpf_cpumask is a BPF-wrapper around the struct cpumask type which
      can be instantiated by a BPF program, and then queried as a cpumask in
      similar fashion to normal kernel code. The previous patch in this series
      makes the type fully RCU safe, so the type can be included in the
      rcu_protected_type BTF ID list.
      
      A subsequent patch will remove bpf_cpumask_kptr_get(), as it's no longer
      useful now that we can just treat the type as RCU safe by default and do
      our own if check.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230316054028.88924-3-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      63d2d83d