1. 25 Jan, 2024 9 commits
    • Andrii Nakryiko's avatar
      bpf: Consistently use BPF token throughout BPF verifier logic · d79a3549
      Andrii Nakryiko authored
      Remove remaining direct queries to perfmon_capable() and bpf_capable()
      in BPF verifier logic and instead use BPF token (if available) to make
      decisions about privileges.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
      d79a3549
    • Andrii Nakryiko's avatar
      bpf: Take into account BPF token when fetching helper protos · bbc1d247
      Andrii Nakryiko authored
      Instead of performing unconditional system-wide bpf_capable() and
      perfmon_capable() calls inside bpf_base_func_proto() function (and other
      similar ones) to determine eligibility of a given BPF helper for a given
      program, use previously recorded BPF token during BPF_PROG_LOAD command
      handling to inform the decision.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
      bbc1d247
    • Andrii Nakryiko's avatar
      bpf: Add BPF token support to BPF_PROG_LOAD command · caf8f28e
      Andrii Nakryiko authored
      Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
      should be set in prog_flags field when providing prog_token_fd.
      
      Wire through a set of allowed BPF program types and attach types,
      derived from BPF FS at BPF token creation time. Then make sure we
      perform bpf_token_capable() checks everywhere where it's relevant.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
      caf8f28e
    • Andrii Nakryiko's avatar
      bpf: Add BPF token support to BPF_BTF_LOAD command · 9ea7c4bf
      Andrii Nakryiko authored
      Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
      through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
      when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
      field before, we also add btf_flags field.
      
      BTF loading is a pretty straightforward operation, so as long as BPF
      token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
      proceeds to parsing BTF data and creating BTF object.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
      9ea7c4bf
    • Andrii Nakryiko's avatar
      bpf: Add BPF token support to BPF_MAP_CREATE command · a177fc2b
      Andrii Nakryiko authored
      Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
      BPF map creation from unprivileged process through delegated BPF token.
      New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
      for BPF_MAP_CREATE command.
      
      Wire through a set of allowed BPF map types to BPF token, derived from
      BPF FS at BPF token creation time. This, in combination with allowed_cmds
      allows to create a narrowly-focused BPF token (controlled by privileged
      agent) with a restrictive set of BPF maps that application can attempt
      to create.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
      a177fc2b
    • Andrii Nakryiko's avatar
      bpf: Introduce BPF token object · 35f96de0
      Andrii Nakryiko authored
      Add new kind of BPF kernel object, BPF token. BPF token is meant to
      allow delegating privileged BPF functionality, like loading a BPF
      program or creating a BPF map, from privileged process to a *trusted*
      unprivileged process, all while having a good amount of control over which
      privileged operations could be performed using provided BPF token.
      
      This is achieved through mounting BPF FS instance with extra delegation
      mount options, which determine what operations are delegatable, and also
      constraining it to the owning user namespace (as mentioned in the
      previous patch).
      
      BPF token itself is just a derivative from BPF FS and can be created
      through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
      FS FD, which can be attained through open() API by opening BPF FS mount
      point. Currently, BPF token "inherits" delegated command, map types,
      prog type, and attach type bit sets from BPF FS as is. In the future,
      having an BPF token as a separate object with its own FD, we can allow
      to further restrict BPF token's allowable set of things either at the
      creation time or after the fact, allowing the process to guard itself
      further from unintentionally trying to load undesired kind of BPF
      programs. But for now we keep things simple and just copy bit sets as is.
      
      When BPF token is created from BPF FS mount, we take reference to the
      BPF super block's owning user namespace, and then use that namespace for
      checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
      capabilities that are normally only checked against init userns (using
      capable()), but now we check them using ns_capable() instead (if BPF
      token is provided). See bpf_token_capable() for details.
      
      Such setup means that BPF token in itself is not sufficient to grant BPF
      functionality. User namespaced process has to *also* have necessary
      combination of capabilities inside that user namespace. So while
      previously CAP_BPF was useless when granted within user namespace, now
      it gains a meaning and allows container managers and sys admins to have
      a flexible control over which processes can and need to use BPF
      functionality within the user namespace (i.e., container in practice).
      And BPF FS delegation mount options and derived BPF tokens serve as
      a per-container "flag" to grant overall ability to use bpf() (plus further
      restrict on which parts of bpf() syscalls are treated as namespaced).
      
      Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
      within the BPF FS owning user namespace, rounding up the ns_capable()
      story of BPF token. Also creating BPF token in init user namespace is
      currently not supported, given BPF token doesn't have any effect in init
      user namespace anyways.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
      35f96de0
    • Andrii Nakryiko's avatar
      bpf: Add BPF token delegation mount options to BPF FS · 6fe01d3c
      Andrii Nakryiko authored
      Add few new mount options to BPF FS that allow to specify that a given
      BPF FS instance allows creation of BPF token (added in the next patch),
      and what sort of operations are allowed under BPF token. As such, we get
      4 new mount options, each is a bit mask
        - `delegate_cmds` allow to specify which bpf() syscall commands are
          allowed with BPF token derived from this BPF FS instance;
        - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
          a set of allowable BPF map types that could be created with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
          a set of allowable BPF program types that could be loaded with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
          a set of allowable BPF program attach types that could be loaded with
          BPF token; delegate_progs and delegate_attachs are meant to be used
          together, as full BPF program type is, in general, determined
          through both program type and program attach type.
      
      Currently, these mount options accept the following forms of values:
        - a special value "any", that enables all possible values of a given
        bit set;
        - numeric value (decimal or hexadecimal, determined by kernel
        automatically) that specifies a bit mask value directly;
        - all the values for a given mount option are combined, if specified
        multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
        delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
        mask.
      
      Ideally, more convenient (for humans) symbolic form derived from
      corresponding UAPI enums would be accepted (e.g., `-o
      delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
      it requires a bunch of UAPI header churn, so I postponed it until this
      feature lands upstream or at least there is a definite consensus that
      this feature is acceptable and is going to make it, just to minimize
      amount of wasted effort and not increase amount of non-essential code to
      be reviewed.
      
      Attentive reader will notice that BPF FS is now marked as
      FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
      user namespace as long as the process has sufficient *namespaced*
      capabilities within that user namespace. But in reality we still
      restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
      init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
      to allow creating BPF FS context object (i.e., fsopen("bpf")) from
      inside unprivileged process inside non-init userns, to capture that
      userns as the owning userns. It will still be required to pass this
      context object back to privileged process to instantiate and mount it.
      
      This manipulation is important, because capturing non-init userns as the
      owning userns of BPF FS instance (super block) allows to use that userns
      to constraint BPF token to that userns later on (see next patch). So
      creating BPF FS with delegation inside unprivileged userns will restrict
      derived BPF token objects to only "work" inside that intended userns,
      making it scoped to a intended "container". Also, setting these
      delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
      process cannot set this up without involvement of a privileged process.
      
      There is a set of selftests at the end of the patch set that simulates
      this sequence of steps and validates that everything works as intended.
      But careful review is requested to make sure there are no missed gaps in
      the implementation and testing.
      
      This somewhat subtle set of aspects is the result of previous
      discussions ([0]) about various user namespace implications and
      interactions with BPF token functionality and is necessary to contain
      BPF token inside intended user namespace.
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
      6fe01d3c
    • Andrii Nakryiko's avatar
      bpf: Align CAP_NET_ADMIN checks with bpf_capable() approach · ed1ad5a7
      Andrii Nakryiko authored
      Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
      compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
      CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
      takes over and grants whatever part of BPF syscall is required.
      
      Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
      One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
      BPF_PROG_LOAD, if the type of BPF program is "network-related" either
      CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
      
      But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
      is set:
        - when creating DEVMAP/XDKMAP/CPU_MAP maps;
        - when attaching CGROUP_SKB programs;
        - when handling BPF_PROG_QUERY command.
      
      This patch is changing the latter three cases to follow BPF_PROG_LOAD
      model, that is allowing to proceed under either CAP_NET_ADMIN or
      CAP_SYS_ADMIN.
      
      This also makes it cleaner in subsequent BPF token patches to switch
      wholesomely to a generic bpf_token_capable(int cap) check, that always
      falls back to CAP_SYS_ADMIN if requested capability is missing.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
      ed1ad5a7
    • Martin KaFai Lau's avatar
      libbpf: Ensure undefined bpf_attr field stays 0 · c9f11556
      Martin KaFai Lau authored
      The commit 9e926acd ("libbpf: Find correct module BTFs for struct_ops maps and progs.")
      sets a newly added field (value_type_btf_obj_fd) to -1 in libbpf when
      the caller of the libbpf's bpf_map_create did not define this field by
      passing a NULL "opts" or passing in a "opts" that does not cover this
      new field. OPT_HAS(opts, field) is used to decide if the field is
      defined or not:
      
      	((opts) && opts->sz >= offsetofend(typeof(*(opts)), field))
      
      Once OPTS_HAS decided the field is not defined, that field should
      be set to 0. For this particular new field (value_type_btf_obj_fd),
      its corresponding map_flags "BPF_F_VTYPE_BTF_OBJ_FD" is not set.
      Thus, the kernel does not treat it as an fd field.
      
      Fixes: 9e926acd ("libbpf: Find correct module BTFs for struct_ops maps and progs.")
      Reported-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240124224418.2905133-1-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c9f11556
  2. 24 Jan, 2024 30 commits
  3. 23 Jan, 2024 1 commit
    • Jose E. Marchesi's avatar
      bpf: Use r constraint instead of p constraint in selftests · bbc094b3
      Jose E. Marchesi authored
      Some of the BPF selftests use the "p" constraint in inline assembly
      snippets, for input operands for MOV (rN = rM) instructions.
      
      This is mainly done via the __imm_ptr macro defined in
      tools/testing/selftests/bpf/progs/bpf_misc.h:
      
        #define __imm_ptr(name) [name]"p"(&name)
      
      Example:
      
        int consume_first_item_only(void *ctx)
        {
              struct bpf_iter_num iter;
              asm volatile (
                      /* create iterator */
                      "r1 = %[iter];"
                      [...]
                      :
                      : __imm_ptr(iter)
                      : CLOBBERS);
              [...]
        }
      
      The "p" constraint is a tricky one.  It is documented in the GCC manual
      section "Simple Constraints":
      
        An operand that is a valid memory address is allowed.  This is for
        ``load address'' and ``push address'' instructions.
      
        p in the constraint must be accompanied by address_operand as the
        predicate in the match_operand.  This predicate interprets the mode
        specified in the match_operand as the mode of the memory reference for
        which the address would be valid.
      
      There are two problems:
      
      1. It is questionable whether that constraint was ever intended to be
         used in inline assembly templates, because its behavior really
         depends on compiler internals.  A "memory address" is not the same
         than a "memory operand" or a "memory reference" (constraint "m"), and
         in fact its usage in the template above results in an error in both
         x86_64-linux-gnu and bpf-unkonwn-none:
      
           foo.c: In function ‘bar’:
           foo.c:6:3: error: invalid 'asm': invalid expression as operand
              6 |   asm volatile ("r1 = %[jorl]" : : [jorl]"p"(&jorl));
                |   ^~~
      
         I would assume the same happens with aarch64, riscv, and most/all
         other targets in GCC, that do not accept operands of the form A + B
         that are not wrapped either in a const or in a memory reference.
      
         To avoid that error, the usage of the "p" constraint in internal GCC
         instruction templates is supposed to be complemented by the 'a'
         modifier, like in:
      
           asm volatile ("r1 = %a[jorl]" : : [jorl]"p"(&jorl));
      
         Internally documented (in GCC's final.cc) as:
      
           %aN means expect operand N to be a memory address
              (not a memory reference!) and print a reference
              to that address.
      
         That works because when the modifier 'a' is found, GCC prints an
         "operand address", which is not the same than an "operand".
      
         But...
      
      2. Even if we used the internal 'a' modifier (we shouldn't) the 'rN =
         rM' instruction really requires a register argument.  In cases
         involving automatics, like in the examples above, we easily end with:
      
           bar:
              #APP
                  r1 = r10-4
              #NO_APP
      
         In other cases we could conceibly also end with a 64-bit label that
         may overflow the 32-bit immediate operand of `rN = imm32'
         instructions:
      
              r1 = foo
      
         All of which is clearly wrong.
      
      clang happens to do "the right thing" in the current usage of __imm_ptr
      in the BPF tests, because even with -O2 it seems to "reload" the
      fp-relative address of the automatic to a register like in:
      
        bar:
      	r1 = r10
      	r1 += -4
      	#APP
      	r1 = r1
      	#NO_APP
      
      Which is what GCC would generate with -O0.  Whether this is by chance
      or by design, the compiler shouln't be expected to do that reload
      driven by the "p" constraint.
      
      This patch changes the usage of the "p" constraint in the BPF
      selftests macros to use the "r" constraint instead.  If a register is
      what is required, we should let the compiler know.
      
      Previous discussion in bpf@vger:
      https://lore.kernel.org/bpf/87h6p5ebpb.fsf@oracle.com/T/#ef0df83d6975c34dff20bf0dd52e078f5b8ca2767
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123181309.19853-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bbc094b3