1. 25 Jan, 2024 3 commits
    • Andrii Nakryiko's avatar
      bpf: Add BPF token delegation mount options to BPF FS · 6fe01d3c
      Andrii Nakryiko authored
      Add few new mount options to BPF FS that allow to specify that a given
      BPF FS instance allows creation of BPF token (added in the next patch),
      and what sort of operations are allowed under BPF token. As such, we get
      4 new mount options, each is a bit mask
        - `delegate_cmds` allow to specify which bpf() syscall commands are
          allowed with BPF token derived from this BPF FS instance;
        - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
          a set of allowable BPF map types that could be created with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
          a set of allowable BPF program types that could be loaded with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
          a set of allowable BPF program attach types that could be loaded with
          BPF token; delegate_progs and delegate_attachs are meant to be used
          together, as full BPF program type is, in general, determined
          through both program type and program attach type.
      
      Currently, these mount options accept the following forms of values:
        - a special value "any", that enables all possible values of a given
        bit set;
        - numeric value (decimal or hexadecimal, determined by kernel
        automatically) that specifies a bit mask value directly;
        - all the values for a given mount option are combined, if specified
        multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
        delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
        mask.
      
      Ideally, more convenient (for humans) symbolic form derived from
      corresponding UAPI enums would be accepted (e.g., `-o
      delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
      it requires a bunch of UAPI header churn, so I postponed it until this
      feature lands upstream or at least there is a definite consensus that
      this feature is acceptable and is going to make it, just to minimize
      amount of wasted effort and not increase amount of non-essential code to
      be reviewed.
      
      Attentive reader will notice that BPF FS is now marked as
      FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
      user namespace as long as the process has sufficient *namespaced*
      capabilities within that user namespace. But in reality we still
      restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
      init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
      to allow creating BPF FS context object (i.e., fsopen("bpf")) from
      inside unprivileged process inside non-init userns, to capture that
      userns as the owning userns. It will still be required to pass this
      context object back to privileged process to instantiate and mount it.
      
      This manipulation is important, because capturing non-init userns as the
      owning userns of BPF FS instance (super block) allows to use that userns
      to constraint BPF token to that userns later on (see next patch). So
      creating BPF FS with delegation inside unprivileged userns will restrict
      derived BPF token objects to only "work" inside that intended userns,
      making it scoped to a intended "container". Also, setting these
      delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
      process cannot set this up without involvement of a privileged process.
      
      There is a set of selftests at the end of the patch set that simulates
      this sequence of steps and validates that everything works as intended.
      But careful review is requested to make sure there are no missed gaps in
      the implementation and testing.
      
      This somewhat subtle set of aspects is the result of previous
      discussions ([0]) about various user namespace implications and
      interactions with BPF token functionality and is necessary to contain
      BPF token inside intended user namespace.
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
      6fe01d3c
    • Andrii Nakryiko's avatar
      bpf: Align CAP_NET_ADMIN checks with bpf_capable() approach · ed1ad5a7
      Andrii Nakryiko authored
      Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
      compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
      CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
      takes over and grants whatever part of BPF syscall is required.
      
      Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
      One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
      BPF_PROG_LOAD, if the type of BPF program is "network-related" either
      CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
      
      But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
      is set:
        - when creating DEVMAP/XDKMAP/CPU_MAP maps;
        - when attaching CGROUP_SKB programs;
        - when handling BPF_PROG_QUERY command.
      
      This patch is changing the latter three cases to follow BPF_PROG_LOAD
      model, that is allowing to proceed under either CAP_NET_ADMIN or
      CAP_SYS_ADMIN.
      
      This also makes it cleaner in subsequent BPF token patches to switch
      wholesomely to a generic bpf_token_capable(int cap) check, that always
      falls back to CAP_SYS_ADMIN if requested capability is missing.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
      ed1ad5a7
    • Martin KaFai Lau's avatar
      libbpf: Ensure undefined bpf_attr field stays 0 · c9f11556
      Martin KaFai Lau authored
      The commit 9e926acd ("libbpf: Find correct module BTFs for struct_ops maps and progs.")
      sets a newly added field (value_type_btf_obj_fd) to -1 in libbpf when
      the caller of the libbpf's bpf_map_create did not define this field by
      passing a NULL "opts" or passing in a "opts" that does not cover this
      new field. OPT_HAS(opts, field) is used to decide if the field is
      defined or not:
      
      	((opts) && opts->sz >= offsetofend(typeof(*(opts)), field))
      
      Once OPTS_HAS decided the field is not defined, that field should
      be set to 0. For this particular new field (value_type_btf_obj_fd),
      its corresponding map_flags "BPF_F_VTYPE_BTF_OBJ_FD" is not set.
      Thus, the kernel does not treat it as an fd field.
      
      Fixes: 9e926acd ("libbpf: Find correct module BTFs for struct_ops maps and progs.")
      Reported-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124224418.2905133-1-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c9f11556
  2. 24 Jan, 2024 30 commits
  3. 23 Jan, 2024 7 commits
    • Jose E. Marchesi's avatar
      bpf: Use r constraint instead of p constraint in selftests · bbc094b3
      Jose E. Marchesi authored
      Some of the BPF selftests use the "p" constraint in inline assembly
      snippets, for input operands for MOV (rN = rM) instructions.
      
      This is mainly done via the __imm_ptr macro defined in
      tools/testing/selftests/bpf/progs/bpf_misc.h:
      
        #define __imm_ptr(name) [name]"p"(&name)
      
      Example:
      
        int consume_first_item_only(void *ctx)
        {
              struct bpf_iter_num iter;
              asm volatile (
                      /* create iterator */
                      "r1 = %[iter];"
                      [...]
                      :
                      : __imm_ptr(iter)
                      : CLOBBERS);
              [...]
        }
      
      The "p" constraint is a tricky one.  It is documented in the GCC manual
      section "Simple Constraints":
      
        An operand that is a valid memory address is allowed.  This is for
        ``load address'' and ``push address'' instructions.
      
        p in the constraint must be accompanied by address_operand as the
        predicate in the match_operand.  This predicate interprets the mode
        specified in the match_operand as the mode of the memory reference for
        which the address would be valid.
      
      There are two problems:
      
      1. It is questionable whether that constraint was ever intended to be
         used in inline assembly templates, because its behavior really
         depends on compiler internals.  A "memory address" is not the same
         than a "memory operand" or a "memory reference" (constraint "m"), and
         in fact its usage in the template above results in an error in both
         x86_64-linux-gnu and bpf-unkonwn-none:
      
           foo.c: In function ‘bar’:
           foo.c:6:3: error: invalid 'asm': invalid expression as operand
              6 |   asm volatile ("r1 = %[jorl]" : : [jorl]"p"(&jorl));
                |   ^~~
      
         I would assume the same happens with aarch64, riscv, and most/all
         other targets in GCC, that do not accept operands of the form A + B
         that are not wrapped either in a const or in a memory reference.
      
         To avoid that error, the usage of the "p" constraint in internal GCC
         instruction templates is supposed to be complemented by the 'a'
         modifier, like in:
      
           asm volatile ("r1 = %a[jorl]" : : [jorl]"p"(&jorl));
      
         Internally documented (in GCC's final.cc) as:
      
           %aN means expect operand N to be a memory address
              (not a memory reference!) and print a reference
              to that address.
      
         That works because when the modifier 'a' is found, GCC prints an
         "operand address", which is not the same than an "operand".
      
         But...
      
      2. Even if we used the internal 'a' modifier (we shouldn't) the 'rN =
         rM' instruction really requires a register argument.  In cases
         involving automatics, like in the examples above, we easily end with:
      
           bar:
              #APP
                  r1 = r10-4
              #NO_APP
      
         In other cases we could conceibly also end with a 64-bit label that
         may overflow the 32-bit immediate operand of `rN = imm32'
         instructions:
      
              r1 = foo
      
         All of which is clearly wrong.
      
      clang happens to do "the right thing" in the current usage of __imm_ptr
      in the BPF tests, because even with -O2 it seems to "reload" the
      fp-relative address of the automatic to a register like in:
      
        bar:
      	r1 = r10
      	r1 += -4
      	#APP
      	r1 = r1
      	#NO_APP
      
      Which is what GCC would generate with -O0.  Whether this is by chance
      or by design, the compiler shouln't be expected to do that reload
      driven by the "p" constraint.
      
      This patch changes the usage of the "p" constraint in the BPF
      selftests macros to use the "r" constraint instead.  If a register is
      what is required, we should let the compiler know.
      
      Previous discussion in bpf@vger:
      https://lore.kernel.org/bpf/87h6p5ebpb.fsf@oracle.com/T/#ef0df83d6975c34dff20bf0dd52e078f5b8ca2767
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123181309.19853-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bbc094b3
    • Jose E. Marchesi's avatar
      bpf: fix constraint in test_tcpbpf_kern.c · 756e34da
      Jose E. Marchesi authored
      GCC emits a warning:
      
        progs/test_tcpbpf_kern.c:60:9: error: ‘op’ is used uninitialized [-Werror=uninitialized]
      
      when an uninialized op is used with a "+r" constraint.  The + modifier
      means a read-write operand, but that operand in the selftest is just
      written to.
      
      This patch changes the selftest to use a "=r" constraint.  This
      pacifies GCC.
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yhs@meta.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: david.faust@oracle.com
      Cc: cupertino.miranda@oracle.com
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123205624.14746-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      756e34da
    • Jose E. Marchesi's avatar
      bpf: avoid VLAs in progs/test_xdp_dynptr.c · edb79903
      Jose E. Marchesi authored
      VLAs are not supported by either the BPF port of clang nor GCC.  The
      selftest test_xdp_dynptr.c contains the following code:
      
        const size_t tcphdr_sz = sizeof(struct tcphdr);
        const size_t udphdr_sz = sizeof(struct udphdr);
        const size_t ethhdr_sz = sizeof(struct ethhdr);
        const size_t iphdr_sz = sizeof(struct iphdr);
        const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
      
        [...]
      
        static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
        {
      	__u8 eth_buffer[ethhdr_sz + iphdr_sz + ethhdr_sz];
      	__u8 iph_buffer_tcp[iphdr_sz + tcphdr_sz];
      	__u8 iph_buffer_udp[iphdr_sz + udphdr_sz];
      	[...]
        }
      
      The eth_buffer, iph_buffer_tcp and other automatics are fixed size
      only if the compiler optimizes away the constant global variables.
      clang does this, but GCC does not, turning these automatics into
      variable length arrays.
      
      This patch removes the global variables and turns these values into
      preprocessor constants.  This makes the selftest to build properly
      with GCC.
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yhs@meta.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: david.faust@oracle.com
      Cc: cupertino.miranda@oracle.com
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123201729.16173-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      edb79903
    • Andrii Nakryiko's avatar
      libbpf: call dup2() syscall directly · bc308d01
      Andrii Nakryiko authored
      We've ran into issues with using dup2() API in production setting, where
      libbpf is linked into large production environment and ends up calling
      unintended custom implementations of dup2(). These custom implementations
      don't provide atomic FD replacement guarantees of dup2() syscall,
      leading to subtle and hard to debug issues.
      
      To prevent this in the future and guarantee that no libc implementation
      will do their own custom non-atomic dup2() implementation, call dup2()
      syscall directly with syscall(SYS_dup2).
      
      Note that some architectures don't seem to provide dup2 and have dup3
      instead. Try to detect and pick best syscall.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240119210201.1295511-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bc308d01
    • Alexei Starovoitov's avatar
      Merge branch 'enable-the-inline-of-kptr_xchg-for-arm64' · c80c6434
      Alexei Starovoitov authored
      Hou Tao says:
      
      ====================
      Enable the inline of kptr_xchg for arm64
      
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      The patch set is just a follow-up for "bpf: inline bpf_kptr_xchg()". It
      enables the inline of bpf_kptr_xchg() and kptr_xchg_inline test for
      arm64.
      
      Please see individual patches for more details. And comments are always
      welcome.
      ====================
      
      Link: https://lore.kernel.org/r/20240119102529.99581-1-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c80c6434
    • Hou Tao's avatar
      selftests/bpf: Enable kptr_xchg_inline test for arm64 · 29f86888
      Hou Tao authored
      Now arm64 bpf jit has enable bpf_jit_supports_ptr_xchg(), so enable
      the test for arm64 as well.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20240119102529.99581-3-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      29f86888
    • Hou Tao's avatar
      bpf, arm64: Enable the inline of bpf_kptr_xchg() · 18a45f12
      Hou Tao authored
      ARM64 bpf jit satisfies the following two conditions:
      1) support BPF_XCHG() on pointer-sized word.
      2) the implementation of xchg is the same as atomic_xchg() on
         pointer-sized words. Both of these two functions use arch_xchg() to
         implement the exchange.
      
      So enable the inline of bpf_kptr_xchg() for arm64 bpf jit.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20240119102529.99581-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18a45f12