1. 06 Dec, 2023 3 commits
    • Andrii Nakryiko's avatar
      bpf: introduce BPF token object · 4527358b
      Andrii Nakryiko authored
      Add new kind of BPF kernel object, BPF token. BPF token is meant to
      allow delegating privileged BPF functionality, like loading a BPF
      program or creating a BPF map, from privileged process to a *trusted*
      unprivileged process, all while having a good amount of control over which
      privileged operations could be performed using provided BPF token.
      
      This is achieved through mounting BPF FS instance with extra delegation
      mount options, which determine what operations are delegatable, and also
      constraining it to the owning user namespace (as mentioned in the
      previous patch).
      
      BPF token itself is just a derivative from BPF FS and can be created
      through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
      FS FD, which can be attained through open() API by opening BPF FS mount
      point. Currently, BPF token "inherits" delegated command, map types,
      prog type, and attach type bit sets from BPF FS as is. In the future,
      having an BPF token as a separate object with its own FD, we can allow
      to further restrict BPF token's allowable set of things either at the
      creation time or after the fact, allowing the process to guard itself
      further from unintentionally trying to load undesired kind of BPF
      programs. But for now we keep things simple and just copy bit sets as is.
      
      When BPF token is created from BPF FS mount, we take reference to the
      BPF super block's owning user namespace, and then use that namespace for
      checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
      capabilities that are normally only checked against init userns (using
      capable()), but now we check them using ns_capable() instead (if BPF
      token is provided). See bpf_token_capable() for details.
      
      Such setup means that BPF token in itself is not sufficient to grant BPF
      functionality. User namespaced process has to *also* have necessary
      combination of capabilities inside that user namespace. So while
      previously CAP_BPF was useless when granted within user namespace, now
      it gains a meaning and allows container managers and sys admins to have
      a flexible control over which processes can and need to use BPF
      functionality within the user namespace (i.e., container in practice).
      And BPF FS delegation mount options and derived BPF tokens serve as
      a per-container "flag" to grant overall ability to use bpf() (plus further
      restrict on which parts of bpf() syscalls are treated as namespaced).
      
      Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
      within the BPF FS owning user namespace, rounding up the ns_capable()
      story of BPF token.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4527358b
    • Andrii Nakryiko's avatar
      bpf: add BPF token delegation mount options to BPF FS · 40bba140
      Andrii Nakryiko authored
      Add few new mount options to BPF FS that allow to specify that a given
      BPF FS instance allows creation of BPF token (added in the next patch),
      and what sort of operations are allowed under BPF token. As such, we get
      4 new mount options, each is a bit mask
        - `delegate_cmds` allow to specify which bpf() syscall commands are
          allowed with BPF token derived from this BPF FS instance;
        - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
          a set of allowable BPF map types that could be created with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
          a set of allowable BPF program types that could be loaded with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
          a set of allowable BPF program attach types that could be loaded with
          BPF token; delegate_progs and delegate_attachs are meant to be used
          together, as full BPF program type is, in general, determined
          through both program type and program attach type.
      
      Currently, these mount options accept the following forms of values:
        - a special value "any", that enables all possible values of a given
        bit set;
        - numeric value (decimal or hexadecimal, determined by kernel
        automatically) that specifies a bit mask value directly;
        - all the values for a given mount option are combined, if specified
        multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
        delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
        mask.
      
      Ideally, more convenient (for humans) symbolic form derived from
      corresponding UAPI enums would be accepted (e.g., `-o
      delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
      it requires a bunch of UAPI header churn, so I postponed it until this
      feature lands upstream or at least there is a definite consensus that
      this feature is acceptable and is going to make it, just to minimize
      amount of wasted effort and not increase amount of non-essential code to
      be reviewed.
      
      Attentive reader will notice that BPF FS is now marked as
      FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
      user namespace as long as the process has sufficient *namespaced*
      capabilities within that user namespace. But in reality we still
      restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
      init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
      to allow creating BPF FS context object (i.e., fsopen("bpf")) from
      inside unprivileged process inside non-init userns, to capture that
      userns as the owning userns. It will still be required to pass this
      context object back to privileged process to instantiate and mount it.
      
      This manipulation is important, because capturing non-init userns as the
      owning userns of BPF FS instance (super block) allows to use that userns
      to constraint BPF token to that userns later on (see next patch). So
      creating BPF FS with delegation inside unprivileged userns will restrict
      derived BPF token objects to only "work" inside that intended userns,
      making it scoped to a intended "container". Also, setting these
      delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
      process cannot set this up without involvement of a privileged process.
      
      There is a set of selftests at the end of the patch set that simulates
      this sequence of steps and validates that everything works as intended.
      But careful review is requested to make sure there are no missed gaps in
      the implementation and testing.
      
      This somewhat subtle set of aspects is the result of previous
      discussions ([0]) about various user namespace implications and
      interactions with BPF token functionality and is necessary to contain
      BPF token inside intended user namespace.
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      40bba140
    • Andrii Nakryiko's avatar
      bpf: align CAP_NET_ADMIN checks with bpf_capable() approach · 909fa05d
      Andrii Nakryiko authored
      Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
      compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
      CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
      takes over and grants whatever part of BPF syscall is required.
      
      Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
      One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
      BPF_PROG_LOAD, if the type of BPF program is "network-related" either
      CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
      
      But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
      is set:
        - when creating DEVMAP/XDKMAP/CPU_MAP maps;
        - when attaching CGROUP_SKB programs;
        - when handling BPF_PROG_QUERY command.
      
      This patch is changing the latter three cases to follow BPF_PROG_LOAD
      model, that is allowing to proceed under either CAP_NET_ADMIN or
      CAP_SYS_ADMIN.
      
      This also makes it cleaner in subsequent BPF token patches to switch
      wholesomely to a generic bpf_token_capable(int cap) check, that always
      falls back to CAP_SYS_ADMIN if requested capability is missing.
      
      Cc: Jakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      909fa05d
  2. 05 Dec, 2023 21 commits
    • Alexei Starovoitov's avatar
      Merge branch 'complete-bpf-verifier-precision-tracking-support-for-register-spills' · 3aee2bf9
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      Complete BPF verifier precision tracking support for register spills
      
      Add support to BPF verifier to track and support register spill/fill to/from
      stack regardless if it was done through read-only R10 register (which is the
      only form supported today), or through a general register after copying R10
      into it, while also potentially modifying offset.
      
      Once we add register this generic spill/fill support to precision
      backtracking, we can take advantage of it to stop doing eager STACK_ZERO
      conversion on register spill. Instead we can rely on (im)precision of spilled
      const zero register to improve verifier state pruning efficiency. This
      situation of using const zero register to initialize stack slots is very
      common with __builtin_memset() usage or just zero-initializing variables on
      the stack, and it causes unnecessary state duplication, as that STACK_ZERO
      knowledge is often not necessary for correctness, as those zero values are
      never used in precise context. Thus, relying on register imprecision helps
      tremendously, especially in real-world BPF programs.
      
      To make spilled const zero register behave completely equivalently to
      STACK_ZERO, we need to improve few other small pieces, which is done in the
      second part of the patch set. See individual patches for details. There are
      also two small bug fixes spotted during STACK_ZERO debugging.
      
      The patch set consists of logically three changes:
        - patch #1 (and corresponding tests in patch #2) is fixing/impoving precision
          propagation for stack spills/fills. This can be landed as a stand-alone
          improvement;
        - patches #3 through #9 is improving verification scalability by utilizing
          register (im)precision instead of eager STACK_ZERO. These changes depend
          on patch #1.
        - patch #10 is a memory efficiency improvement to how instruction/jump
          history is tracked and maintained. It depends on patch #1, but is not
          strictly speaking required, even though I believe it's a good long-term
          solution to have a path-dependent per-instruction information. Kind
          of like a path-dependent counterpart to path-agnostic insn_aux array.
      
      v3->v3:
        - fixed up Fixes tag (Alexei);
        - fixed few more selftests to not use BPF_ST instruction in inline asm
          directly, checked with CI, it was happy (CI);
      v2->v3:
        - BPF_ST instruction workaround (Eduard);
        - force dereference in added tests to catch problems (Eduard);
        - some commit message massaging (Alexei);
      v1->v2:
        - clean ups, WARN_ONCE(), insn_flags helpers added (Eduard);
        - added more selftests for STACK_ZERO/STACK_MISC cases (Eduard);
        - a bit more detailed explanation of effect of avoiding STACK_ZERO in favor
          of register spill in patch #8 commit (Alexei);
        - global shared instruction history refactoring moved to be the last patch
          in the series to make it easier to revert it, if applied (Alexei).
      ====================
      
      Link: https://lore.kernel.org/r/20231205184248.1502704-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3aee2bf9
    • Andrii Nakryiko's avatar
      selftests/bpf: validate precision logic in partial_stack_load_preserves_zeros · 064e0bea
      Andrii Nakryiko authored
      Enhance partial_stack_load_preserves_zeros subtest with detailed
      precision propagation log checks. We know expect fp-16 to be spilled,
      initially imprecise, zero const register, which is later marked as
      precise even when partial stack slot load is performed, even if it's not
      a register fill (!).
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-10-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      064e0bea
    • Andrii Nakryiko's avatar
      bpf: track aligned STACK_ZERO cases as imprecise spilled registers · 18a433b6
      Andrii Nakryiko authored
      Now that precision backtracing is supporting register spill/fill to/from
      stack, there is another oportunity to be exploited here: minimizing
      precise STACK_ZERO cases. With a simple code change we can rely on
      initially imprecise register spill tracking for cases when register
      spilled to stack was a known zero.
      
      This is a very common case for initializing on the stack variables,
      including rather large structures. Often times zero has no special
      meaning for the subsequent BPF program logic and is often overwritten
      with non-zero values soon afterwards. But due to STACK_ZERO vs
      STACK_MISC tracking, such initial zero initialization actually causes
      duplication of verifier states as STACK_ZERO is clearly different than
      STACK_MISC or spilled SCALAR_VALUE register.
      
      The effect of this (now) trivial change is huge, as can be seen below.
      These are differences between BPF selftests, Cilium, and Meta-internal
      BPF object files relative to previous patch in this series. You can see
      improvements ranging from single-digit percentage improvement for
      instructions and states, all the way to 50-60% reduction for some of
      Meta-internal host agent programs, and even some Cilium programs.
      
      For Meta-internal ones I left only the differences for largest BPF
      object files by states/instructions, as there were too many differences
      in the overall output. All the differences were improvements, reducting
      number of states and thus instructions validated.
      
      Note, Meta-internal BPF object file names are not printed below.
      Many copies of balancer_ingress are actually many different
      configurations of Katran, so they are different BPF programs, which
      explains state reduction going from -16% all the way to 31%, depending
      on BPF program logic complexity.
      
      I also tooked a closer look at a few small-ish BPF programs to validate
      the behavior. Let's take bpf_iter_netrlink.bpf.o (first row below).
      While it's just 8 vs 5 states, verifier log is still pretty long to
      include it here. But the reduction in states is due to the following
      piece of C code:
      
              unsigned long ino;
      
      	...
      
              sk = s->sk_socket;
              if (!sk) {
                      ino = 0;
              } else {
                      inode = SOCK_INODE(sk);
                      bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
              }
              BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
      	return 0;
      
      You can see that in some situations `ino` is zero-initialized, while in
      others it's unknown value filled out by bpf_probe_read_kernel(). Before
      this change code after if/else branches have to be validated twice. Once
      with (precise) ino == 0, due to eager STACK_ZERO logic, and then again
      for when ino is just STACK_MISC. But BPF_SEQ_PRINTF() doesn't care about
      precise value of ino, so with the change in this patch verifier is able
      to prune states from after one of the branches, reducing number of total
      states (and instructions) required for successful validation.
      
      Similar principle applies to bigger real-world applications, just at
      a much larger scale.
      
      SELFTESTS
      =========
      File                                     Program                  Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
      ---------------------------------------  -----------------------  ---------  ---------  ---------------  ----------  ----------  -------------
      bpf_iter_netlink.bpf.linked3.o           dump_netlink                   148        104    -44 (-29.73%)           8           5   -3 (-37.50%)
      bpf_iter_unix.bpf.linked3.o              dump_unix                     8474       8404     -70 (-0.83%)         151         147    -4 (-2.65%)
      bpf_loop.bpf.linked3.o                   stack_check                    560        324   -236 (-42.14%)          42          24  -18 (-42.86%)
      local_storage_bench.bpf.linked3.o        get_local                      120         77    -43 (-35.83%)           9           6   -3 (-33.33%)
      loop6.bpf.linked3.o                      trace_virtqueue_add_sgs      10167       9868    -299 (-2.94%)         226         206   -20 (-8.85%)
      pyperf600_bpf_loop.bpf.linked3.o         on_event                      4872       3423  -1449 (-29.74%)         322         229  -93 (-28.88%)
      strobemeta.bpf.linked3.o                 on_event                    180697     176036   -4661 (-2.58%)        4780        4734   -46 (-0.96%)
      test_cls_redirect.bpf.linked3.o          cls_redirect                 65594      65401    -193 (-0.29%)        4230        4212   -18 (-0.43%)
      test_global_func_args.bpf.linked3.o      test_cls                       145        136      -9 (-6.21%)          10           9   -1 (-10.00%)
      test_l4lb.bpf.linked3.o                  balancer_ingress              4760       2612  -2148 (-45.13%)         113         102   -11 (-9.73%)
      test_l4lb_noinline.bpf.linked3.o         balancer_ingress              4845       4877     +32 (+0.66%)         219         221    +2 (+0.91%)
      test_l4lb_noinline_dynptr.bpf.linked3.o  balancer_ingress              2072       2087     +15 (+0.72%)          97          98    +1 (+1.03%)
      test_seg6_loop.bpf.linked3.o             __add_egr_x                  12440       9975  -2465 (-19.82%)         364         353   -11 (-3.02%)
      test_tcp_hdr_options.bpf.linked3.o       estab                         2558       2572     +14 (+0.55%)         179         180    +1 (+0.56%)
      test_xdp_dynptr.bpf.linked3.o            _xdp_tx_iptunnel               645        596     -49 (-7.60%)          26          24    -2 (-7.69%)
      test_xdp_noinline.bpf.linked3.o          balancer_ingress_v6           3520       3516      -4 (-0.11%)         216         216    +0 (+0.00%)
      xdp_synproxy_kern.bpf.linked3.o          syncookie_tc                 82661      81241   -1420 (-1.72%)        5073        5155   +82 (+1.62%)
      xdp_synproxy_kern.bpf.linked3.o          syncookie_xdp                84964      82297   -2667 (-3.14%)        5130        5157   +27 (+0.53%)
      
      META-INTERNAL
      =============
      Program                                 Insns (A)  Insns (B)  Insns      (DIFF)  States (A)  States (B)  States   (DIFF)
      --------------------------------------  ---------  ---------  -----------------  ----------  ----------  ---------------
      balancer_ingress                            27925      23608    -4317 (-15.46%)        1488        1482      -6 (-0.40%)
      balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
      balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
      balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
      balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
      balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
      balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
      balancer_ingress                            40339      30792    -9547 (-23.67%)        2193        1934   -259 (-11.81%)
      balancer_ingress                            37321      29055    -8266 (-22.15%)        1972        1795    -177 (-8.98%)
      balancer_ingress                            38176      29753    -8423 (-22.06%)        2008        1831    -177 (-8.81%)
      balancer_ingress                            29193      20910    -8283 (-28.37%)        1599        1422   -177 (-11.07%)
      balancer_ingress                            30013      21452    -8561 (-28.52%)        1645        1447   -198 (-12.04%)
      balancer_ingress                            28691      24290    -4401 (-15.34%)        1545        1531     -14 (-0.91%)
      balancer_ingress                            34223      28965    -5258 (-15.36%)        1984        1875    -109 (-5.49%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
      balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            34844      29485    -5359 (-15.38%)        2036        1918    -118 (-5.80%)
      fbflow_egress                                3256       2652     -604 (-18.55%)         218         192    -26 (-11.93%)
      fbflow_ingress                               1026        944       -82 (-7.99%)          70          63     -7 (-10.00%)
      sslwall_tc_egress                            8424       7360    -1064 (-12.63%)         498         458     -40 (-8.03%)
      syar_accept_protect                         15040       9539    -5501 (-36.58%)         364         220   -144 (-39.56%)
      syar_connect_tcp_v6                         15036       9535    -5501 (-36.59%)         360         216   -144 (-40.00%)
      syar_connect_udp_v4                         15039       9538    -5501 (-36.58%)         361         217   -144 (-39.89%)
      syar_connect_connect4_protect4              24805      15833    -8972 (-36.17%)         756         480   -276 (-36.51%)
      syar_lsm_file_open                         167772     151813    -15959 (-9.51%)        1836        1667    -169 (-9.20%)
      syar_namespace_create_new                   14805       9304    -5501 (-37.16%)         353         209   -144 (-40.79%)
      syar_python3_detect                         17531      12030    -5501 (-31.38%)         391         247   -144 (-36.83%)
      syar_ssh_post_fork                          16412      10911    -5501 (-33.52%)         405         261   -144 (-35.56%)
      syar_enter_execve                           14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
      syar_enter_execveat                         14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
      syar_exit_execve                            16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
      syar_exit_execveat                          16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
      syar_syscalls_kill                          15288       9787    -5501 (-35.98%)         398         254   -144 (-36.18%)
      syar_task_enter_pivot_root                  14898       9397    -5501 (-36.92%)         357         213   -144 (-40.34%)
      syar_syscalls_setreuid                      16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
      syar_syscalls_setuid                        16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
      syar_syscalls_process_vm_readv              14959       9458    -5501 (-36.77%)         364         220   -144 (-39.56%)
      syar_syscalls_process_vm_writev             15757      10256    -5501 (-34.91%)         390         246   -144 (-36.92%)
      do_uprobe                                   15519      10018    -5501 (-35.45%)         373         229   -144 (-38.61%)
      edgewall                                   179715      55783  -123932 (-68.96%)       12607        3999  -8608 (-68.28%)
      bictcp_state                                 7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
      cubictcp_state                               7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
      tcp_rate_skb_delivered                        447        272     -175 (-39.15%)          29          18    -11 (-37.93%)
      kprobe__bbr_set_state                        4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
      kprobe__bictcp_state                         4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
      inet_sock_set_state                          1501       1337     -164 (-10.93%)          93          85      -8 (-8.60%)
      tcp_retransmit_skb                           1145        981     -164 (-14.32%)          67          59     -8 (-11.94%)
      tcp_retransmit_synack                        1183        951     -232 (-19.61%)          67          55    -12 (-17.91%)
      bpf_tcptuner                                 1459       1187     -272 (-18.64%)          99          80    -19 (-19.19%)
      tw_egress                                     801        776       -25 (-3.12%)          69          66      -3 (-4.35%)
      tw_ingress                                    795        770       -25 (-3.14%)          69          66      -3 (-4.35%)
      ttls_tc_ingress                             19025      19383      +358 (+1.88%)         470         465      -5 (-1.06%)
      ttls_nat_egress                               490        299     -191 (-38.98%)          33          20    -13 (-39.39%)
      ttls_nat_ingress                              448        285     -163 (-36.38%)          32          21    -11 (-34.38%)
      tw_twfw_egress                             511127     212071  -299056 (-58.51%)       16733        8504  -8229 (-49.18%)
      tw_twfw_ingress                            500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
      tw_twfw_tc_eg                              511113     212064  -299049 (-58.51%)       16732        8504  -8228 (-49.18%)
      tw_twfw_tc_in                              500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
      tw_twfw_egress                              12632      12435      -197 (-1.56%)         276         260     -16 (-5.80%)
      tw_twfw_ingress                             12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
      tw_twfw_tc_eg                               12595      12435      -160 (-1.27%)         274         259     -15 (-5.47%)
      tw_twfw_tc_in                               12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
      tw_xdp_dump                                   266        209      -57 (-21.43%)           9           8     -1 (-11.11%)
      
      CILIUM
      =========
      File           Program                           Insns (A)  Insns (B)  Insns     (DIFF)  States (A)  States (B)  States  (DIFF)
      -------------  --------------------------------  ---------  ---------  ----------------  ----------  ----------  --------------
      bpf_host.o     cil_to_netdev                          6047       4578   -1469 (-24.29%)         362         249  -113 (-31.22%)
      bpf_host.o     handle_lxc_traffic                     2227       1585    -642 (-28.83%)         156         103   -53 (-33.97%)
      bpf_host.o     tail_handle_ipv4_from_netdev           2244       1458    -786 (-35.03%)         163         106   -57 (-34.97%)
      bpf_host.o     tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
      bpf_host.o     tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
      bpf_host.o     tail_ipv4_host_policy_ingress          2219       1367    -852 (-38.40%)         161          96   -65 (-40.37%)
      bpf_host.o     tail_nodeport_nat_egress_ipv4         22460      19862   -2598 (-11.57%)        1469        1293  -176 (-11.98%)
      bpf_host.o     tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
      bpf_host.o     tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
      bpf_host.o     tail_nodeport_nat_ipv6_egress          3702       3542     -160 (-4.32%)         215         205    -10 (-4.65%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
      bpf_lxc.o      tail_ipv4_ct_egress                    5073       3374   -1699 (-33.49%)         262         172   -90 (-34.35%)
      bpf_lxc.o      tail_ipv4_ct_ingress                   5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
      bpf_lxc.o      tail_ipv4_ct_ingress_policy_only       5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
      bpf_lxc.o      tail_ipv6_ct_egress                    4593       3878    -715 (-15.57%)         194         151   -43 (-22.16%)
      bpf_lxc.o      tail_ipv6_ct_ingress                   4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
      bpf_lxc.o      tail_ipv6_ct_ingress_policy_only       4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
      bpf_lxc.o      tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
      bpf_lxc.o      tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
      bpf_overlay.o  tail_handle_nat_fwd_ipv4              20524      10114  -10410 (-50.72%)        1271         638  -633 (-49.80%)
      bpf_overlay.o  tail_nodeport_nat_egress_ipv4         22718      19490   -3228 (-14.21%)        1475        1275  -200 (-13.56%)
      bpf_overlay.o  tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
      bpf_overlay.o  tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
      bpf_overlay.o  tail_nodeport_nat_ipv6_egress          3638       3548      -90 (-2.47%)         209         203     -6 (-2.87%)
      bpf_overlay.o  tail_rev_nodeport_lb4                  4368       3820    -548 (-12.55%)         248         215   -33 (-13.31%)
      bpf_overlay.o  tail_rev_nodeport_lb6                  2867       2428    -439 (-15.31%)         167         140   -27 (-16.17%)
      bpf_sock.o     cil_sock6_connect                      1718       1703      -15 (-0.87%)         100          99     -1 (-1.00%)
      bpf_xdp.o      tail_handle_nat_fwd_ipv4              12917      12443     -474 (-3.67%)         875         849    -26 (-2.97%)
      bpf_xdp.o      tail_handle_nat_fwd_ipv6              13515      13264     -251 (-1.86%)         715         702    -13 (-1.82%)
      bpf_xdp.o      tail_lb_ipv4                          39492      36367    -3125 (-7.91%)        2430        2251   -179 (-7.37%)
      bpf_xdp.o      tail_lb_ipv6                          80441      78058    -2383 (-2.96%)        3647        3523   -124 (-3.40%)
      bpf_xdp.o      tail_nodeport_ipv6_dsr                 1038        901    -137 (-13.20%)          61          55     -6 (-9.84%)
      bpf_xdp.o      tail_nodeport_nat_egress_ipv4         13027      12096     -931 (-7.15%)         868         809    -59 (-6.80%)
      bpf_xdp.o      tail_nodeport_nat_ingress_ipv4         7617       5900   -1717 (-22.54%)         522         413  -109 (-20.88%)
      bpf_xdp.o      tail_nodeport_nat_ingress_ipv6         7575       7395     -180 (-2.38%)         383         374     -9 (-2.35%)
      bpf_xdp.o      tail_rev_nodeport_lb4                  6808       6739      -69 (-1.01%)         403         396     -7 (-1.74%)
      bpf_xdp.o      tail_rev_nodeport_lb6                 16173      15847     -326 (-2.02%)        1010         990    -20 (-1.98%)
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18a433b6
    • Andrii Nakryiko's avatar
      selftests/bpf: validate zero preservation for sub-slot loads · add1cd7f
      Andrii Nakryiko authored
      Validate that 1-, 2-, and 4-byte loads from stack slots not aligned on
      8-byte boundary still preserve zero, when loading from all-STACK_ZERO
      sub-slots, or when stack sub-slots are covered by spilled register with
      known constant zero value.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-8-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      add1cd7f
    • Andrii Nakryiko's avatar
      bpf: preserve constant zero when doing partial register restore · e322f0bc
      Andrii Nakryiko authored
      Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from
      stack from slot that has register spilled into it and that register has
      a constant value zero, preserve that zero and mark spilled register as
      precise for that. This makes spilled const zero register and STACK_ZERO
      cases equivalent in their behavior.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e322f0bc
    • Andrii Nakryiko's avatar
      selftests/bpf: validate STACK_ZERO is preserved on subreg spill · b33ceb6a
      Andrii Nakryiko authored
      Add tests validating that STACK_ZERO slots are preserved when slot is
      partially overwritten with subregister spill.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b33ceb6a
    • Andrii Nakryiko's avatar
      bpf: preserve STACK_ZERO slots on partial reg spills · eaf18feb
      Andrii Nakryiko authored
      Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in
      situations where this is possible. E.g., when spilling register as
      1/2/4-byte subslots on the stack, all the remaining bytes in the stack
      slot do not automatically become unknown. If we knew they contained
      zeroes, we can preserve those STACK_ZERO markers.
      
      Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(),
      but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note
      that we need to take into account possibility of being in unprivileged
      mode, in which case STACK_INVALID is forced to STACK_MISC for correctness,
      as treating STACK_INVALID as equivalent STACK_MISC is only enabled in
      privileged mode.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      eaf18feb
    • Andrii Nakryiko's avatar
      bpf: fix check for attempt to corrupt spilled pointer · ab125ed3
      Andrii Nakryiko authored
      When register is spilled onto a stack as a 1/2/4-byte register, we set
      slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it,
      depending on actual spill size). So to check if some stack slot has
      spilled register we need to consult slot_type[7], not slot_type[0].
      
      To avoid the need to remember and double-check this in the future, just
      use is_spilled_reg() helper.
      
      Fixes: 27113c59 ("bpf: Check the other end of slot_type for STACK_SPILL")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ab125ed3
    • Andrii Nakryiko's avatar
      selftests/bpf: add stack access precision test · 87630188
      Andrii Nakryiko authored
      Add a new selftests that validates precision tracking for stack access
      instruction, using both r10-based and non-r10-based accesses. For
      non-r10 ones we also make sure to have non-zero var_off to validate that
      final stack offset is tracked properly in instruction history
      information inside verifier.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87630188
    • Andrii Nakryiko's avatar
      bpf: support non-r10 register spill/fill to/from stack in precision tracking · 41f6f64e
      Andrii Nakryiko authored
      Use instruction (jump) history to record instructions that performed
      register spill/fill to/from stack, regardless if this was done through
      read-only r10 register, or any other register after copying r10 into it
      *and* potentially adjusting offset.
      
      To make this work reliably, we push extra per-instruction flags into
      instruction history, encoding stack slot index (spi) and stack frame
      number in extra 10 bit flags we take away from prev_idx in instruction
      history. We don't touch idx field for maximum performance, as it's
      checked most frequently during backtracking.
      
      This change removes basically the last remaining practical limitation of
      precision backtracking logic in BPF verifier. It fixes known
      deficiencies, but also opens up new opportunities to reduce number of
      verified states, explored in the subsequent patches.
      
      There are only three differences in selftests' BPF object files
      according to veristat, all in the positive direction (less states).
      
      File                                    Program        Insns (A)  Insns (B)  Insns  (DIFF)  States (A)  States (B)  States (DIFF)
      --------------------------------------  -------------  ---------  ---------  -------------  ----------  ----------  -------------
      test_cls_redirect_dynptr.bpf.linked3.o  cls_redirect        2987       2864  -123 (-4.12%)         240         231    -9 (-3.75%)
      xdp_synproxy_kern.bpf.linked3.o         syncookie_tc       82848      82661  -187 (-0.23%)        5107        5073   -34 (-0.67%)
      xdp_synproxy_kern.bpf.linked3.o         syncookie_xdp      85116      84964  -152 (-0.18%)        5162        5130   -32 (-0.62%)
      
      Note, I avoided renaming jmp_history to more generic insn_hist to
      minimize number of lines changed and potential merge conflicts between
      bpf and bpf-next trees.
      
      Notice also cur_hist_entry pointer reset to NULL at the beginning of
      instruction verification loop. This pointer avoids the problem of
      relying on last jump history entry's insn_idx to determine whether we
      already have entry for current instruction or not. It can happen that we
      added jump history entry because current instruction is_jmp_point(), but
      also we need to add instruction flags for stack access. In this case, we
      don't want to entries, so we need to reuse last added entry, if it is
      present.
      
      Relying on insn_idx comparison has the same ambiguity problem as the one
      that was fixed recently in [0], so we avoid that.
      
        [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Reported-by: default avatarTao Lyu <tao.lyu@epfl.ch>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41f6f64e
    • Stanislav Fomichev's avatar
      selftests/bpf: Make sure we trigger metadata kfuncs for dst 8080 · 5ffb260f
      Stanislav Fomichev authored
      xdp_metadata test is flaky sometimes:
      
        verify_xsk_metadata:FAIL:rx_hash_type unexpected rx_hash_type: actual 8 != expected 0
      
      Where 8 means XDP_RSS_TYPE_L4_ANY and is exported from veth driver only when
      'skb->l4_hash' condition is met. This makes me think that the program is
      triggering again for some other packet.
      
      Let's have a filter, similar to xdp_hw_metadata, where we trigger XDP kfuncs
      only for UDP packets destined to port 8080.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20231204174423.3460052-1-sdf@google.com
      5ffb260f
    • Stanislav Fomichev's avatar
      xsk: Add missing SPDX to AF_XDP TX metadata documentation · 5c399ae0
      Stanislav Fomichev authored
      Not sure how I missed that. I even acknowledged it explicitly
      in the changelog [0]. Add the tag for real now.
      
        [0] https://lore.kernel.org/bpf/20231127190319.1190813-1-sdf@google.com/
      
      Fixes: 11614723 ("xsk: Add option to calculate TX checksum in SW")
      Suggested-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20231204174231.3457705-1-sdf@google.com
      5c399ae0
    • Dave Marchevsky's avatar
      selftests/bpf: Test bpf_kptr_xchg stashing of bpf_rb_root · 1b4c7e20
      Dave Marchevsky authored
      There was some confusion amongst Meta sched_ext folks regarding whether
      stashing bpf_rb_root - the tree itself, rather than a single node - was
      supported. This patch adds a small test which demonstrates this
      functionality: a local kptr with rb_root is created, a node is created
      and added to the tree, then the tree is kptr_xchg'd into a mapval.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20231204211722.571346-1-davemarchevsky@fb.com
      1b4c7e20
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fix-the-release-of-inner-map' · ce3c49da
      Alexei Starovoitov authored
      Hou Tao says:
      
      ====================
      bpf: Fix the release of inner map
      
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      The patchset aims to fix the release of inner map in map array or map
      htab. The release of inner map is different with normal map. For normal
      map, the map is released after the bpf program which uses the map is
      destroyed, because the bpf program tracks the used maps. However bpf
      program can not track the used inner map because these inner map may be
      updated or deleted dynamically, and for now the ref-counter of inner map
      is decreased after the inner map is remove from outer map, so the inner
      map may be freed before the bpf program, which is accessing the inner
      map, exits and there will be use-after-free problem as demonstrated by
      patch #6.
      
      The patchset fixes the problem by deferring the release of inner map.
      The freeing of inner map is deferred according to the sleepable
      attributes of the bpf programs which own the outer map. Patch #1 fixes
      the warning when running the newly-added selftest under interpreter
      mode. Patch #2 adds more parameters to .map_fd_put_ptr() to prepare for
      the fix. Patch #3 fixes the incorrect value of need_defer when freeing
      the fd array. Patch #4 fixes the potential use-after-free problem by
      using call_rcu_tasks_trace() and call_rcu() to wait for one tasks trace
      RCU GP and one RCU GP unconditionally. Patch #5 optimizes the free of
      inner map by removing the unnecessary RCU GP waiting. Patch #6 adds a
      selftest to demonstrate the potential use-after-free problem. Patch #7
      updates a selftest to update outer map in syscall bpf program.
      
      Please see individual patches for more details. And comments are always
      welcome.
      
      Change Log:
      v5:
       * patch #3: rename fd_array_map_delete_elem_with_deferred_free() to
                   __fd_array_map_delete_elem() (Alexei)
       * patch #5: use atomic64_t instead of atomic_t to prevent potential
                   overflow (Alexei)
       * patch #7: use ptr_to_u64() helper instead of force casting to initialize
                   pointers in bpf_attr (Alexei)
      
      v4: https://lore.kernel.org/bpf/20231130140120.1736235-1-houtao@huaweicloud.com
        * patch #2: don't use "deferred", use "need_defer" uniformly
        * patch #3: newly-added, fix the incorrect value of need_defer during
                    fd array free.
        * patch #4: doesn't consider the case in which bpf map is not used by
                    any bpf program and only use sleepable_refcnt to remove
      	      unnecessary tasks trace RCU GP (Alexei)
        * patch #4: remove memory barriers added due to cautiousness (Alexei)
      
      v3: https://lore.kernel.org/bpf/20231124113033.503338-1-houtao@huaweicloud.com
        * multiple variable renamings (Martin)
        * define BPF_MAP_RCU_GP/BPF_MAP_RCU_TT_GP as bit (Martin)
        * use call_rcu() and its variants instead of synchronize_rcu() (Martin)
        * remove unnecessary mask in bpf_map_free_deferred() (Martin)
        * place atomic_or() and the related smp_mb() together (Martin)
        * add patch #6 to demonstrate that updating outer map in syscall
          program is dead-lock free (Alexei)
        * update comments about the memory barrier in bpf_map_fd_put_ptr()
        * update commit message for patch #3 and #4 to describe more details
      
      v2: https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@huaweicloud.com
        * defer the invocation of ops->map_free() instead of bpf_map_put() (Martin)
        * update selftest to make it being reproducible under JIT mode (Martin)
        * remove unnecessary preparatory patches
      
      v1: https://lore.kernel.org/bpf/20231107140702.1891778-1-houtao@huaweicloud.com
      ====================
      
      Link: https://lore.kernel.org/r/20231204140425.1480317-1-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ce3c49da
    • Hou Tao's avatar
      selftests/bpf: Test outer map update operations in syscall program · e3dd4082
      Hou Tao authored
      Syscall program is running with rcu_read_lock_trace being held, so if
      bpf_map_update_elem() or bpf_map_delete_elem() invokes
      synchronize_rcu_tasks_trace() when operating on an outer map, there will
      be dead-lock, so add a test to guarantee that it is dead-lock free.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-8-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e3dd4082
    • Hou Tao's avatar
      selftests/bpf: Add test cases for inner map · 1624918b
      Hou Tao authored
      Add test cases to test the race between the destroy of inner map due to
      map-in-map update and the access of inner map in bpf program. The
      following 4 combinations are added:
      (1) array map in map array + bpf program
      (2) array map in map array + sleepable bpf program
      (3) array map in map htab + bpf program
      (4) array map in map htab + sleepable bpf program
      
      Before applying the fixes, when running `./test_prog -a map_in_map`, the
      following error was reported:
      
        ==================================================================
        BUG: KASAN: slab-use-after-free in array_map_update_elem+0x48/0x3e0
        Read of size 4 at addr ffff888114f33824 by task test_progs/1858
      
        CPU: 1 PID: 1858 Comm: test_progs Tainted: G           O     6.6.0+ #7
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
        Call Trace:
         <TASK>
         dump_stack_lvl+0x4a/0x90
         print_report+0xd2/0x620
         kasan_report+0xd1/0x110
         __asan_load4+0x81/0xa0
         array_map_update_elem+0x48/0x3e0
         bpf_prog_be94a9f26772f5b7_access_map_in_array+0xe6/0xf6
         trace_call_bpf+0x1aa/0x580
         kprobe_perf_func+0xdd/0x430
         kprobe_dispatcher+0xa0/0xb0
         kprobe_ftrace_handler+0x18b/0x2e0
         0xffffffffc02280f7
        RIP: 0010:__x64_sys_getpgid+0x1/0x30
        ......
         </TASK>
      
        Allocated by task 1857:
         kasan_save_stack+0x26/0x50
         kasan_set_track+0x25/0x40
         kasan_save_alloc_info+0x1e/0x30
         __kasan_kmalloc+0x98/0xa0
         __kmalloc_node+0x6a/0x150
         __bpf_map_area_alloc+0x141/0x170
         bpf_map_area_alloc+0x10/0x20
         array_map_alloc+0x11f/0x310
         map_create+0x28a/0xb40
         __sys_bpf+0x753/0x37c0
         __x64_sys_bpf+0x44/0x60
         do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
        Freed by task 11:
         kasan_save_stack+0x26/0x50
         kasan_set_track+0x25/0x40
         kasan_save_free_info+0x2b/0x50
         __kasan_slab_free+0x113/0x190
         slab_free_freelist_hook+0xd7/0x1e0
         __kmem_cache_free+0x170/0x260
         kfree+0x9b/0x160
         kvfree+0x2d/0x40
         bpf_map_area_free+0xe/0x20
         array_map_free+0x120/0x2c0
         bpf_map_free_deferred+0xd7/0x1e0
         process_one_work+0x462/0x990
         worker_thread+0x370/0x670
         kthread+0x1b0/0x200
         ret_from_fork+0x3a/0x70
         ret_from_fork_asm+0x1b/0x30
      
        Last potentially related work creation:
         kasan_save_stack+0x26/0x50
         __kasan_record_aux_stack+0x94/0xb0
         kasan_record_aux_stack_noalloc+0xb/0x20
         __queue_work+0x331/0x950
         queue_work_on+0x75/0x80
         bpf_map_put+0xfa/0x160
         bpf_map_fd_put_ptr+0xe/0x20
         bpf_fd_array_map_update_elem+0x174/0x1b0
         bpf_map_update_value+0x2b7/0x4a0
         __sys_bpf+0x2551/0x37c0
         __x64_sys_bpf+0x44/0x60
         do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-7-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1624918b
    • Hou Tao's avatar
      bpf: Optimize the free of inner map · af66bfd3
      Hou Tao authored
      When removing the inner map from the outer map, the inner map will be
      freed after one RCU grace period and one RCU tasks trace grace
      period, so it is certain that the bpf program, which may access the
      inner map, has exited before the inner map is freed.
      
      However there is no need to wait for one RCU tasks trace grace period if
      the outer map is only accessed by non-sleepable program. So adding
      sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
      the outer map into env->used_maps for sleepable program. Although the
      max number of bpf program is INT_MAX - 1, the number of bpf programs
      which are being loaded may be greater than INT_MAX, so using atomic64_t
      instead of atomic_t for sleepable_refcnt. When removing the inner map
      from the outer map, using sleepable_refcnt to decide whether or not a
      RCU tasks trace grace period is needed before freeing the inner map.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af66bfd3
    • Hou Tao's avatar
      bpf: Defer the free of inner map when necessary · 87667336
      Hou Tao authored
      When updating or deleting an inner map in map array or map htab, the map
      may still be accessed by non-sleepable program or sleepable program.
      However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
      directly through bpf_map_put(), if the ref-counter is the last one
      (which is true for most cases), the inner map will be freed by
      ops->map_free() in a kworker. But for now, most .map_free() callbacks
      don't use synchronize_rcu() or its variants to wait for the elapse of a
      RCU grace period, so after the invocation of ops->map_free completes,
      the bpf program which is accessing the inner map may incur
      use-after-free problem.
      
      Fix the free of inner map by invoking bpf_map_free_deferred() after both
      one RCU grace period and one tasks trace RCU grace period if the inner
      map has been removed from the outer map before. The deferment is
      accomplished by using call_rcu() or call_rcu_tasks_trace() when
      releasing the last ref-counter of bpf map. The newly-added rcu_head
      field in bpf_map shares the same storage space with work field to
      reduce the size of bpf_map.
      
      Fixes: bba1dc0b ("bpf: Remove redundant synchronize_rcu.")
      Fixes: 638e4b82 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87667336
    • Hou Tao's avatar
      bpf: Set need_defer as false when clearing fd array during map free · 79d93b3c
      Hou Tao authored
      Both map deletion operation, map release and map free operation use
      fd_array_map_delete_elem() to remove the element from fd array and
      need_defer is always true in fd_array_map_delete_elem(). For the map
      deletion operation and map release operation, need_defer=true is
      necessary, because the bpf program, which accesses the element in fd
      array, may still alive. However for map free operation, it is certain
      that the bpf program which owns the fd array has already been exited, so
      setting need_defer as false is appropriate for map free operation.
      
      So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and
      adding a new helper __fd_array_map_delete_elem() to handle the map
      deletion, map release and map free operations correspondingly.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      79d93b3c
    • Hou Tao's avatar
      bpf: Add map and need_defer parameters to .map_fd_put_ptr() · 20c20bd1
      Hou Tao authored
      map is the pointer of outer map, and need_defer needs some explanation.
      need_defer tells the implementation to defer the reference release of
      the passed element and ensure that the element is still alive before
      the bpf program, which may manipulate it, exits.
      
      The following three cases will invoke map_fd_put_ptr() and different
      need_defer values will be passed to these callers:
      
      1) release the reference of the old element in the map during map update
         or map deletion. The release must be deferred, otherwise the bpf
         program may incur use-after-free problem, so need_defer needs to be
         true.
      2) release the reference of the to-be-added element in the error path of
         map update. The to-be-added element is not visible to any bpf
         program, so it is OK to pass false for need_defer parameter.
      3) release the references of all elements in the map during map release.
         Any bpf program which has access to the map must have been exited and
         released, so need_defer=false will be OK.
      
      These two parameters will be used by the following patches to fix the
      potential use-after-free problem for map-in-map.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      20c20bd1
    • Hou Tao's avatar
      bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers · 169410eb
      Hou Tao authored
      These three bpf_map_{lookup,update,delete}_elem() helpers are also
      available for sleepable bpf program, so add the corresponding lock
      assertion for sleepable bpf program, otherwise the following warning
      will be reported when a sleepable bpf program manipulates bpf map under
      interpreter mode (aka bpf_jit_enable=0):
      
        WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
        CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
        RIP: 0010:bpf_map_lookup_elem+0x54/0x60
        ......
        Call Trace:
         <TASK>
         ? __warn+0xa5/0x240
         ? bpf_map_lookup_elem+0x54/0x60
         ? report_bug+0x1ba/0x1f0
         ? handle_bug+0x40/0x80
         ? exc_invalid_op+0x18/0x50
         ? asm_exc_invalid_op+0x1b/0x20
         ? __pfx_bpf_map_lookup_elem+0x10/0x10
         ? rcu_lockdep_current_cpu_online+0x65/0xb0
         ? rcu_is_watching+0x23/0x50
         ? bpf_map_lookup_elem+0x54/0x60
         ? __pfx_bpf_map_lookup_elem+0x10/0x10
         ___bpf_prog_run+0x513/0x3b70
         __bpf_prog_run32+0x9d/0xd0
         ? __bpf_prog_enter_sleepable_recur+0xad/0x120
         ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
         bpf_trampoline_6442580665+0x4d/0x1000
         __x64_sys_getpgid+0x5/0x30
         ? do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
         </TASK>
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      169410eb
  3. 04 Dec, 2023 2 commits
  4. 02 Dec, 2023 14 commits