1. 20 Jul, 2023 17 commits
  2. 19 Jul, 2023 23 commits
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · e93165d5
      Jakub Kicinski authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2023-07-19
      
      We've added 45 non-merge commits during the last 3 day(s) which contain
      a total of 71 files changed, 7808 insertions(+), 592 deletions(-).
      
      The main changes are:
      
      1) multi-buffer support in AF_XDP, from Maciej Fijalkowski,
         Magnus Karlsson, Tirthendu Sarkar.
      
      2) BPF link support for tc BPF programs, from Daniel Borkmann.
      
      3) Enable bpf_map_sum_elem_count kfunc for all program types,
         from Anton Protopopov.
      
      4) Add 'owner' field to bpf_rb_node to fix races in shared ownership,
         Dave Marchevsky.
      
      5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
        bpf, net: Introduce skb_pointer_if_linear().
        bpf: sync tools/ uapi header with
        selftests/bpf: Add mprog API tests for BPF tcx links
        selftests/bpf: Add mprog API tests for BPF tcx opts
        bpftool: Extend net dump with tcx progs
        libbpf: Add helper macro to clear opts structs
        libbpf: Add link-based API for tcx
        libbpf: Add opts-based attach/detach/query API for tcx
        bpf: Add fd-based tcx multi-prog infra with link support
        bpf: Add generic attach/detach/query API for multi-progs
        selftests/xsk: reset NIC settings to default after running test suite
        selftests/xsk: add test for too many frags
        selftests/xsk: add metadata copy test for multi-buff
        selftests/xsk: add invalid descriptor test for multi-buffer
        selftests/xsk: add unaligned mode test for multi-buffer
        selftests/xsk: add basic multi-buffer test
        selftests/xsk: transmit and receive multi-buffer packets
        xsk: add multi-buffer documentation
        i40e: xsk: add TX multi-buffer support
        ice: xsk: Tx multi-buffer support
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20230719175424.75717-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e93165d5
    • Jakub Kicinski's avatar
      Merge tag 'linux-can-next-for-6.6-20230719' of... · 97083c21
      Jakub Kicinski authored
      Merge tag 'linux-can-next-for-6.6-20230719' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
      
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can-next 2023-07-19
      
      The first 2 patches are by Judith Mendez, target the m_can driver and
      add hrtimer based polling support for TI AM62x SoCs, where the
      interrupt of the MCU domain's m_can cores is not routed to the Cortex
      A53 core.
      
      A patch by Rob Herring converts the grcan driver to use the correct DT
      include files.
      
      Michal Simek and Srinivas Neeli add support for optional reset control
      to the xilinx_can driver.
      
      The next 2 patches are by Jimmy Assarsson and add support for new
      Kvaser pciefd to the kvaser_pciefd driver.
      
      Mao Zhu's patch for the ucan driver removes a repeated word from a
      comment.
      
      * tag 'linux-can-next-for-6.6-20230719' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
        can: ucan: Remove repeated word
        can: kvaser_pciefd: Add support for new Kvaser pciefd devices
        can: kvaser_pciefd: Move hardware specific constants and functions into a driver_data struct
        can: Explicitly include correct DT includes
        can: xilinx_can: Add support for controller reset
        dt-bindings: can: xilinx_can: Add reset description
        can: m_can: Add hrtimer to generate software interrupt
        dt-bindings: net: can: Remove interrupt properties for MCAN
      ====================
      
      Link: https://lore.kernel.org/r/20230719072348.525039-1-mkl@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      97083c21
    • Alexei Starovoitov's avatar
      bpf, net: Introduce skb_pointer_if_linear(). · 6f5a630d
      Alexei Starovoitov authored
      Network drivers always call skb_header_pointer() with non-null buffer.
      Remove !buffer check to prevent accidental misuse of skb_header_pointer().
      Introduce skb_pointer_if_linear() instead.
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230718234021.43640-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6f5a630d
    • Alan Maguire's avatar
      bpf: sync tools/ uapi header with · 41ee0145
      Alan Maguire authored
      Seeing the following:
      
      Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'
      
      ...so sync tools version missing some list_node/rb_tree fields.
      
      Fixes: c3c510ce ("bpf: Add 'owner' field to bpf_{list,rb}_node")
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Link: https://lore.kernel.org/r/20230719162257.20818-1-alan.maguire@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41ee0145
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-link-support-for-tc-bpf-programs' · 24cc7564
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      BPF link support for tc BPF programs
      
      This series adds BPF link support for tc BPF programs. We initially
      presented the motivation, related work and design at last year's LPC
      conference in the networking & BPF track [0], and a recent update on
      our progress of the rework during this year's LSF/MM/BPF summit [1].
      The main changes are in first two patches and the last two have an
      extensive batch of test cases we developed along with it, please see
      individual patches for details. We tested this series with tc-testing
      selftest suite as well as BPF CI/selftests. Thanks!
      
      v5 -> v6:
        - Remove export symbol on tcx_inc/dec (Jakub)
        - Treat fd==0 as invalid (Stan, Alexei)
      v4 -> v5:
        - Updated bpftool docs and usage of bpftool net (Quentin)
        - Consistent dump "prog id"/"link id" -> "prog_id"/"link_id" (Quentin)
        - Reworked bpftool flag output handling (Quentin)
        - LIBBPF_OPTS_RESET() macro with varargs for reinit (Andrii)
        - libbpf opts/link bail out on relative_fd && relative_id (Andrii)
        - libbpf improvements for assigning attr.relative_{id,fd} (Andrii)
        - libbpf sorting in libbpf.map (Andrii)
        - libbpf move ifindex to bpf_program__attach_tcx param (Andrii)
        - libbpf move BPF_F_ID flag handling to bpf_link_create (Andrii)
        - bpf_program_attach_fd with tcx instead of tc (Andrii)
        - Reworking kernel-internal bpf_mprog API (Alexei, Andrii)
        - Change "object" notation to "id_or_fd" (Andrii)
        - Remove on stack cpp[BPF_MPROG_MAX] and switch to memmove (Andrii)
        - Simplify bpf_mprog_{insert,delete} and add comment on internals
        - Get rid of BPF_MPROG_* return codes (Alexei, Andrii)
      v3 -> v4:
        - Fix bpftool output to display tcx/{ingress,egress} (Stan)
        - Documentation around API, BPF_MPROG_* return codes and locking
          expectations (Stan, Alexei)
        - Change _after and _before to have the same semantics for return
          value (Alexei)
        - Rework mprog initialization and move allocation/free one layer
          up into tcx to simplify the code (Stan)
        - Add comment on synchronize_rcu and parent->ref (Stan)
        - Add comment on bpf_mprog_pos_() helpers wrt target position (Stan)
      v2 -> v3:
        - Removal of BPF_F_FIRST/BPF_F_LAST from control UAPI (Toke, Stan)
        - Along with that full rework of bpf_mprog internals to simplify
          dependency management, looks much nicer now imho
        - Just single bpf_mprog_cp instead of two (Andrii)
        - atomic64_t for revision counter (Andrii)
        - Evaluate target position and reject on conflicts (Andrii)
        - Keep track of actual count in bpf_mprob_bundle (Andrii)
        - Make combo of REPLACE and BEFORE/AFTER work (Andrii)
        - Moved miniq as first struct member (Jamal)
        - Rework tcx_link_attach with regards to rtnl (Jakub, Andrii)
        - Moved wrappers after bpf_prog_detach_ops (Andrii)
        - Removed union for relative_fd and friends for opts and link in
          libbpf (Andrii)
        - Add doc comments to attach/detach/query libbpf APIs (Andrii)
        - Dropped SEC_ATTACHABLE_OPT (Andrii)
        - Add an OPTS_ZEROED check to bpf_link_create (Andrii)
        - Keep opts as the last argument in bpf_program_attach_fd (Andrii)
        - Rework bpf_program_attach_fd (Andrii)
        - Remove OPTS_GET before we checked OPTS_VALID in
          bpf_program__attach_tcx (Andrii)
        - Add `size_t :0;` to prevent compiler from leaving garbage (Andrii)
        - Add helper macro to clear opts structs which I found useful
          when writing tests
        - Rework of both opts and link test cases to accommodate for changes
      v1 -> v2:
        - Rework of almost entire series to remove prio from UAPI and switch
          to better control directives BPF_F_FIRST/BPF_F_LAST/BPF_F_BEFORE/
          BPF_F_AFTER (Alexei, Toke, Stan, Andrii)
        - Addition of big test suite to cover all corner cases
      
        [0] https://lpc.events/event/16/contributions/1353/
        [1] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
      ====================
      
      Link: https://lore.kernel.org/r/20230719140858.13224-1-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      24cc7564
    • Daniel Borkmann's avatar
      selftests/bpf: Add mprog API tests for BPF tcx links · c6d479b3
      Daniel Borkmann authored
      Add a big batch of test coverage to assert all aspects of the tcx link API:
      
        # ./vmtest.sh -- ./test_progs -t tc_links
        [...]
        #225     tc_links_after:OK
        #226     tc_links_append:OK
        #227     tc_links_basic:OK
        #228     tc_links_before:OK
        #229     tc_links_chain_classic:OK
        #230     tc_links_dev_cleanup:OK
        #231     tc_links_invalid:OK
        #232     tc_links_prepend:OK
        #233     tc_links_replace:OK
        #234     tc_links_revision:OK
        Summary: 10/0 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20230719140858.13224-9-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c6d479b3
    • Daniel Borkmann's avatar
      selftests/bpf: Add mprog API tests for BPF tcx opts · cd13c91d
      Daniel Borkmann authored
      Add a big batch of test coverage to assert all aspects of the tcx opts
      attach, detach and query API:
      
        # ./vmtest.sh -- ./test_progs -t tc_opts
        [...]
        #238     tc_opts_after:OK
        #239     tc_opts_append:OK
        #240     tc_opts_basic:OK
        #241     tc_opts_before:OK
        #242     tc_opts_chain_classic:OK
        #243     tc_opts_demixed:OK
        #244     tc_opts_detach:OK
        #245     tc_opts_detach_after:OK
        #246     tc_opts_detach_before:OK
        #247     tc_opts_dev_cleanup:OK
        #248     tc_opts_invalid:OK
        #249     tc_opts_mixed:OK
        #250     tc_opts_prepend:OK
        #251     tc_opts_replace:OK
        #252     tc_opts_revision:OK
        Summary: 15/0 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20230719140858.13224-8-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd13c91d
    • Daniel Borkmann's avatar
      bpftool: Extend net dump with tcx progs · 57c61da8
      Daniel Borkmann authored
      Add support to dump fd-based attach types via bpftool. This includes both
      the tc BPF link and attach ops programs. Dumped information contain the
      attach location, function entry name, program ID and link ID when applicable.
      
      Example with tc BPF link:
      
        # ./bpftool net
        xdp:
      
        tc:
        bond0(4) tcx/ingress cil_from_netdev prog_id 784 link_id 10
        bond0(4) tcx/egress cil_to_netdev prog_id 804 link_id 11
      
        flow_dissector:
      
        netfilter:
      
      Example with tc BPF attach ops:
      
        # ./bpftool net
        xdp:
      
        tc:
        bond0(4) tcx/ingress cil_from_netdev prog_id 654
        bond0(4) tcx/egress cil_to_netdev prog_id 672
      
        flow_dissector:
      
        netfilter:
      
      Currently, permanent flags are not yet supported, so 'unknown' ones are dumped
      via NET_DUMP_UINT_ONLY() and once we do have permanent ones, we dump them as
      human readable string.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/r/20230719140858.13224-7-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      57c61da8
    • Daniel Borkmann's avatar
      libbpf: Add helper macro to clear opts structs · 4e9c2d9a
      Daniel Borkmann authored
      Add a small and generic LIBBPF_OPTS_RESET() helper macros which clears an
      opts structure and reinitializes its .sz member to place the structure
      size. Additionally, the user can pass option-specific data to reinitialize
      via varargs.
      
      I found this very useful when developing selftests, but it is also generic
      enough as a macro next to the existing LIBBPF_OPTS() which hides the .sz
      initialization, too.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20230719140858.13224-6-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4e9c2d9a
    • Daniel Borkmann's avatar
      libbpf: Add link-based API for tcx · 55cc3768
      Daniel Borkmann authored
      Implement tcx BPF link support for libbpf.
      
      The bpf_program__attach_fd() API has been refactored slightly in order to pass
      bpf_link_create_opts pointer as input.
      
      A new bpf_program__attach_tcx() has been added on top of this which allows for
      passing all relevant data via extensible struct bpf_tcx_opts.
      
      The program sections tcx/ingress and tcx/egress correspond to the hook locations
      for tc ingress and egress, respectively.
      
      For concrete usage examples, see the extensive selftests that have been
      developed as part of this series.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230719140858.13224-5-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      55cc3768
    • Daniel Borkmann's avatar
      libbpf: Add opts-based attach/detach/query API for tcx · fe20ce3a
      Daniel Borkmann authored
      Extend libbpf attach opts and add a new detach opts API so this can be used
      to add/remove fd-based tcx BPF programs. The old-style bpf_prog_detach() and
      bpf_prog_detach2() APIs are refactored to reuse the new bpf_prog_detach_opts()
      internally.
      
      The bpf_prog_query_opts() API got extended to be able to handle the new
      link_ids, link_attach_flags and revision fields.
      
      For concrete usage examples, see the extensive selftests that have been
      developed as part of this series.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230719140858.13224-4-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fe20ce3a
    • Daniel Borkmann's avatar
      bpf: Add fd-based tcx multi-prog infra with link support · e420bed0
      Daniel Borkmann authored
      This work refactors and adds a lightweight extension ("tcx") to the tc BPF
      ingress and egress data path side for allowing BPF program management based
      on fds via bpf() syscall through the newly added generic multi-prog API.
      The main goal behind this work which we also presented at LPC [0] last year
      and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
      BPF link functionality for tc BPF programs, which allows for a model of safe
      ownership and program detachment.
      
      Given the rise in tc BPF users in cloud native environments, this becomes
      necessary to avoid hard to debug incidents either through stale leftover
      programs or 3rd party applications accidentally stepping on each others toes.
      As a recap, a BPF link represents the attachment of a BPF program to a BPF
      hook point. The BPF link holds a single reference to keep BPF program alive.
      Moreover, hook points do not reference a BPF link, only the application's
      fd or pinning does. A BPF link holds meta-data specific to attachment and
      implements operations for link creation, (atomic) BPF program update,
      detachment and introspection. The motivation for BPF links for tc BPF programs
      is multi-fold, for example:
      
        - From Meta: "It's especially important for applications that are deployed
          fleet-wide and that don't "control" hosts they are deployed to. If such
          application crashes and no one notices and does anything about that, BPF
          program will keep running draining resources or even just, say, dropping
          packets. We at FB had outages due to such permanent BPF attachment
          semantics. With fd-based BPF link we are getting a framework, which allows
          safe, auto-detachable behavior by default, unless application explicitly
          opts in by pinning the BPF link." [1]
      
        - From Cilium-side the tc BPF programs we attach to host-facing veth devices
          and phys devices build the core datapath for Kubernetes Pods, and they
          implement forwarding, load-balancing, policy, EDT-management, etc, within
          BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
          experienced hard-to-debug issues in a user's staging environment where
          another Kubernetes application using tc BPF attached to the same prio/handle
          of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
          it. The goal is to establish a clear/safe ownership model via links which
          cannot accidentally be overridden. [0,2]
      
      BPF links for tc can co-exist with non-link attachments, and the semantics are
      in line also with XDP links: BPF links cannot replace other BPF links, BPF
      links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
      lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
      would solve mentioned issue of safe ownership model as 3rd party applications
      would not be able to accidentally wipe Cilium programs, even if they are not
      BPF link aware.
      
      Earlier attempts [4] have tried to integrate BPF links into core tc machinery
      to solve cls_bpf, which has been intrusive to the generic tc kernel API with
      extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
      be wiped from the qdisc also. Locking a tc BPF program in place this way, is
      getting into layering hacks given the two object models are vastly different.
      
      We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
      attach API, so that the BPF link implementation blends in naturally similar to
      other link types which are fd-based and without the need for changing core tc
      internal APIs. BPF programs for tc can then be successively migrated from classic
      cls_bpf to the new tc BPF link without needing to change the program's source
      code, just the BPF loader mechanics for attaching is sufficient.
      
      For the current tc framework, there is no change in behavior with this change
      and neither does this change touch on tc core kernel APIs. The gist of this
      patch is that the ingress and egress hook have a lightweight, qdisc-less
      extension for BPF to attach its tc BPF programs, in other words, a minimal
      entry point for tc BPF. The name tcx has been suggested from discussion of
      earlier revisions of this work as a good fit, and to more easily differ between
      the classic cls_bpf attachment and the fd-based one.
      
      For the ingress and egress tcx points, the device holds a cache-friendly array
      with program pointers which is separated from control plane (slow-path) data.
      Earlier versions of this work used priority to determine ordering and expression
      of dependencies similar as with classic tc, but it was challenged that for
      something more future-proof a better user experience is required. Hence this
      resulted in the design and development of the generic attach/detach/query API
      for multi-progs. See prior patch with its discussion on the API design. tcx is
      the first user and later we plan to integrate also others, for example, one
      candidate is multi-prog support for XDP which would benefit and have the same
      'look and feel' from API perspective.
      
      The goal with tcx is to have maximum compatibility to existing tc BPF programs,
      so they don't need to be rewritten specifically. Compatibility to call into
      classic tcf_classify() is also provided in order to allow successive migration
      or both to cleanly co-exist where needed given its all one logical tc layer and
      the tcx plus classic tc cls/act build one logical overall processing pipeline.
      
      tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
      to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
      The fd-based API is behind a static key, so that when unused the code is also
      not entered. The struct tcx_entry's program array is currently static, but
      could be made dynamic if necessary at a point in future. The a/b pair swap
      design has been chosen so that for detachment there are no allocations which
      otherwise could fail.
      
      The work has been tested with tc-testing selftest suite which all passes, as
      well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
      
      Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
      of this work.
      
        [0] https://lpc.events/event/16/contributions/1353/
        [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
        [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
        [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
        [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.comSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e420bed0
    • Daniel Borkmann's avatar
      bpf: Add generic attach/detach/query API for multi-progs · 053c8e1f
      Daniel Borkmann authored
      This adds a generic layer called bpf_mprog which can be reused by different
      attachment layers to enable multi-program attachment and dependency resolution.
      In-kernel users of the bpf_mprog don't need to care about the dependency
      resolution internals, they can just consume it with few API calls.
      
      The initial idea of having a generic API sparked out of discussion [0] from an
      earlier revision of this work where tc's priority was reused and exposed via
      BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
      as-is for classic tc BPF. The feedback was that priority provides a bad user
      experience and is hard to use [1], e.g.:
      
        I cannot help but feel that priority logic copy-paste from old tc, netfilter
        and friends is done because "that's how things were done in the past". [...]
        Priority gets exposed everywhere in uapi all the way to bpftool when it's
        right there for users to understand. And that's the main problem with it.
      
        The user don't want to and don't need to be aware of it, but uapi forces them
        to pick the priority. [...] Your cover letter [0] example proves that in
        real life different service pick the same priority. They simply don't know
        any better. Priority is an unnecessary magic that apps _have_ to pick, so
        they just copy-paste and everyone ends up using the same.
      
      The course of the discussion showed more and more the need for a generic,
      reusable API where the "same look and feel" can be applied for various other
      program types beyond just tc BPF, for example XDP today does not have multi-
      program support in kernel, but also there was interest around this API for
      improving management of cgroup program types. Such common multi-program
      management concept is useful for BPF management daemons or user space BPF
      applications coordinating internally about their attachments.
      
      Both from Cilium and Meta side [2], we've collected the following requirements
      for a generic attach/detach/query API for multi-progs which has been implemented
      as part of this work:
      
        - Support prog-based attach/detach and link API
        - Dependency directives (can also be combined):
          - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
            - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
              space application does not need CAP_SYS_ADMIN to retrieve foreign fds
              via bpf_*_get_fd_by_id()
            - BPF_F_LINK flag as {prog,link} toggle
            - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
              BPF_F_AFTER will just append for attaching
            - Enforced only at attach time
          - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
            own infra for replacing their internal prog
          - If no flags are set, then it's default append behavior for attaching
        - Internal revision counter and optionally being able to pass expected_revision
        - User space application can query current state with revision, and pass it
          along for attachment to assert current state before doing updates
        - Query also gets extension for link_ids array and link_attach_flags:
          - prog_ids are always filled with program IDs
          - link_ids are filled with link IDs when link was used, otherwise 0
          - {prog,link}_attach_flags for holding {prog,link}-specific flags
        - Must be easy to integrate/reuse for in-kernel users
      
      The uapi-side changes needed for supporting bpf_mprog are rather minimal,
      consisting of the additions of the attachment flags, revision counter, and
      expanding existing union with relative_{fd,id} member.
      
      The bpf_mprog framework consists of an bpf_mprog_entry object which holds
      an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
      structure) is part of bpf_mprog_bundle. Both have been separated, so that
      fast-path gets efficient packing of bpf_prog pointers for maximum cache
      efficiency. Also, array has been chosen instead of linked list or other
      structures to remove unnecessary indirections for a fast point-to-entry in
      tc for BPF.
      
      The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
      updates the peer bpf_mprog_entry is populated and then just swapped which
      avoids additional allocations that could otherwise fail, for example, in
      detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
      be converted to dynamic allocation if necessary at a point in future.
      Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
      of tcx which uses this API in the next patch, it piggybacks on rtnl.
      
      An extensive test suite for checking all aspects of this API for prog-based
      attach/detach and link API comes as BPF selftests in this series.
      
      Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
      management.
      
        [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
        [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
        [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      053c8e1f
    • Alexei Starovoitov's avatar
      Merge branch 'xsk-multi-buffer-support' · 3226e313
      Alexei Starovoitov authored
      Maciej Fijalkowski says:
      
      ====================
      xsk: multi-buffer support
      
      v6->v7:
      - rebase...[Alexei]
      
      v5->v6:
      - update bpf_xdp_query_opts__last_field in patch 10 [Alexei]
      
      v4->v5:
      - align options argument size to match options from xdp_desc [Benjamin]
      - cleanup skb from xdp_sock on socket termination [Toke]
      - introduce new netlink attribute for letting user space know about Tx
        frag limit; this substitutes xdp_features flag previously dedicated
        for setting ZC multi-buffer support [Toke, Jakub]
      - include i40e ZC multi-buffer support
      - enable TOO_MANY_FRAGS for ZC on xskxceiver; this is now possible due
        to netlink attribute mentioned two bullets above
      
      v3->v4:
      -rely on ynl for adding new xdp_features flag [Jakub]
      - move xskb_list to xsk_buff_pool
      
      v2->v3:
      - Fix issue with the next valid packet getting dropped after an invalid
        packet with MAX_SKB_FRAGS + 1 frags [Magnus]
      - query NETDEV_XDP_ACT_ZC_SG flag within xskxceiver and act on it
      - remove redundant include in xsk.c [kernel test robot]
      - s/NETDEV_XDP_ACT_NDO_ZC_SG/NETDEV_XDP_ACT_ZC_SG + kernel doc [Magnus,
        Simon]
      
      v1->v2:
      - fix spelling issues in commit messages [Simon]
      - remove XSK_DESC_MAX_FRAGS, use MAX_SKB_FRAGS instead [Stan, Alexei]
      - add documentation patch
      - fix build error from kernel test robot on patch 10
      
      This series of patches add multi-buffer support for AF_XDP. XDP and
      various NIC drivers already have support for multi-buffer packets. With
      this patch set, programs using AF_XDP sockets can now also receive and
      transmit multi-buffer packets both in copy as well as zero-copy mode.
      ZC multi-buffer implementation is based on ice driver.
      
      Some definitions to put us all on the same page:
      
      * A packet consists of one or more frames
      
      * A descriptor in one of the AF_XDP rings always refers to a single
        frame. In the case the packet consists of a single frame, the
        descriptor refers to the whole packet.
      
      To represent a packet consisting of multiple frames, we introduce a
      new flag called XDP_PKT_CONTD in the options field of the Rx and Tx
      descriptors. If it is true (1) the packet continues with the next
      descriptor and if it is false (0) it means this is the last descriptor
      of the packet. Why the reverse logic of end-of-packet (eop) flag found
      in many NICs? Just to preserve compatibility with non-multi-buffer
      applications that have this bit set to false for all packets on Rx, and
      the apps set the options field to zero for Tx, as anything else will
      be treated as an invalid descriptor.
      
      These are the semantics for producing packets onto XSK Tx ring
      consisting of multiple frames:
      
      * When an invalid descriptor is found, all the other
        descriptors/frames of this packet are marked as invalid and not
        completed. The next descriptor is treated as the start of a new
        packet, even if this was not the intent (because we cannot guess
        the intent). As before, if your program is producing invalid
        descriptors you have a bug that must be fixed.
      
      * Zero length descriptors are treated as invalid descriptors.
      
      * For copy mode, the maximum supported number of frames in a packet is
        equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
        descriptors accumulated so far are dropped and treated as
        invalid. To produce an application that will work on any system
        regardless of this config setting, limit the number of frags to 18,
        as the minimum value of the config is 17.
      
      * For zero-copy mode, the limit is up to what the NIC HW
        supports. User space can discover this via newly introduced
        NETDEV_A_DEV_XDP_ZC_MAX_SEGS netlink attribute.
      
      Here is an example Tx path pseudo-code (using libxdp interfaces for
      simplicity) ignoring that the umem is finite in size, and that we
      eventually will run out of packets to send. Also assumes pkts.addr
      points to a valid location in the umem.
      
      void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
                      int batch_size)
      {
      	u32 idx, i, pkt_nb = 0;
      
      	xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
      
      	for (i = 0; i < batch_size;) {
      		u64 addr = pkts[pkt_nb].addr;
      		u32 len = pkts[pkt_nb].size;
      
      		do {
      			struct xdp_desc *tx_desc;
      
      			tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
      			tx_desc->addr = addr;
      
      			if (len > xsk_frame_size) {
      				tx_desc->len = xsk_frame_size;
      				tx_desc->options |= XDP_PKT_CONTD;
      			} else {
      				tx_desc->len = len;
      				tx_desc->options = 0;
      				pkt_nb++;
      			}
      			len -= tx_desc->len;
      			addr += xsk_frame_size;
      
      			if (i == batch_size) {
      				/* Remember len, addr, pkt_nb for next
      				 * iteration. Skipped for simplicity.
      				 */
      				break;
      			}
      		} while (len);
      	}
      
      	xsk_ring_prod__submit(&xsk->tx, i);
      }
      
      On the Rx path in copy mode, the xsk core copies the XDP data into
      multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
      detailed before. Zero-copy mode in order to avoid the copies has to
      maintain a chain of xdp_buff_xsk structs that represent whole packet.
      This is because what actually is redirected is the xdp_buff and we
      currently have no equivalent mechanism that is used for copy mode
      (embedded skb_shared_info in xdp_buff) to carry the frags. This means
      xdp_buff_xsk grows in size but these members are at the end and should
      not be touched when data path is not dealing with fragmented packets.
      This solution kept us within assumed performance impact, hence we
      decided to proceed with it.
      
      When the application gets a descriptor with the
      XDP_PKT_CONTD flag set to one, it means that the packet consists of
      multiple buffers and it continues with the next buffer in the following
      descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
      means that this is the last buffer of the packet. AF_XDP guarantees that
      only a complete packet (all frames in the packet) is sent to the
      application.
      
      If application reads a batch of descriptors, using for example the libxdp
      interfaces, it is not guaranteed that the batch will end with a full
      packet. It might end in the middle of a packet and the rest of the
      buffers of that packet will arrive at the beginning of the next batch,
      since the libxdp interface does not read the whole ring (unless you
      have an enormous batch size or a very small ring size).
      
      Here is a simple Rx path pseudo-code example (using libxdp interfaces for
      simplicity). Error paths have been excluded for simplicity:
      
      void rx_packets(struct xsk_socket_info *xsk)
      {
      	static bool new_packet = true;
      	u32 idx_rx = 0, idx_fq = 0;
      	static char *pkt;
      
      	int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
      
      	xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
      
      	for (int i = 0; i < rcvd; i++) {
      		struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
      		char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
      		bool eop = !(desc->options & XDP_PKT_CONTD);
      
      		if (new_packet)
      			pkt = frag;
      		else
      			add_frag_to_pkt(pkt, frag);
      
      		if (eop)
      			process_pkt(pkt);
      
      		new_packet = eop;
      
      		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
      	}
      
      	xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
      	xsk_ring_cons__release(&xsk->rx, rcvd);
      }
      
      We had to introduce a new bind flag (XDP_USE_SG) on the AF_XDP level to
      enable multi-buffer support. The reason we need to differentiate between
      non multi-buffer and multi-buffer is the behaviour when the kernel gets
      a packet that is larger than the frame size. Without multi-buffer, this
      packet is dropped and marked in the stats. With multi-buffer on, we want
      to split it up into multiple frames instead.
      
      At the start, we thought that riding on the .frags section name of
      the XDP program was a good idea. You do not have to introduce yet
      another flag and all AF_XDP users must load an XDP program anyway
      to get any traffic up to the socket, so why not just say that the XDP
      program decides if the AF_XDP socket should get multi-buffer packets
      or not? The problem is that we can create an AF_XDP socket that is Tx
      only and that works without having to load an XDP program at
      all. Another problem is that the XDP program might change during the
      execution, so we would have to check this for every single packet.
      
      Here is the observed throughput when compared to a codebase without any
      multi-buffer changes and measured with xdpsock for 64B packets.
      Apparently ZC Tx takes a hit from explicit zero length descriptors
      validation. Overall, in terms of ZC performance, there is a room for
      improvement, but for now we think this work is in a good shape in terms
      of correctness and functionality. We were targetting for up to 5%
      overhead though. Note that ZC performance drops come from core + driver
      support being combined, whereas copy mode had already driver support in
      place.
      
      Mode     rxdrop       l2fwd       txonly
      ice-zc    -4%          -7%         -6%
      i40e-zc   -7%          -6%         -7%
      drv       -1.2%         0%         +2%
      skb       -0.6%        -1%         +2%
      
      Thank you,
      Tirthendu, Magnus and Maciej
      ====================
      
      Link: https://lore.kernel.org/r/20230719132421.584801-1-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3226e313
    • Maciej Fijalkowski's avatar
      selftests/xsk: reset NIC settings to default after running test suite · 3666bcca
      Maciej Fijalkowski authored
      Currently, when running ZC test suite, after finishing first run of test
      suite and then switching to busy-poll tests within xskxceiver, such
      errors are observed:
      
      libbpf: Kernel error message: ice: MTU is too large for linear frames and XDP prog does not support frags
      1..26
      libbpf: Kernel error message: Native and generic XDP can't be active at the same time
      Error attaching XDP program
      not ok 1 [xskxceiver.c:xsk_reattach_xdp:1568]: ERROR: 17/"File exists"
      
      this is because test suite ends with 9k MTU and native xdp program being
      loaded. Busy-poll tests start non-multi-buffer tests for generic mode.
      To fix this, let us introduce bash function that will reset NIC settings
      to default (e.g. 1500 MTU and no xdp progs loaded) so that test suite
      can continue without interrupts. It also means that after busy-poll
      tests NIC will have those default settings, whereas right now it is left
      with 9k MTU and xdp prog loaded in native mode.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-25-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3666bcca
    • Magnus Karlsson's avatar
      selftests/xsk: add test for too many frags · 807bf4da
      Magnus Karlsson authored
      Add a test that will exercise maximum number of supported fragments.
      This number depends on mode of the test - for SKB and DRV it will be 18
      whereas for ZC this is defined by a value from NETDEV_A_DEV_XDP_ZC_MAX_SEGS
      netlink attribute.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # made use of new netlink attribute
      Link: https://lore.kernel.org/r/20230719132421.584801-24-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      807bf4da
    • Magnus Karlsson's avatar
      selftests/xsk: add metadata copy test for multi-buff · f80ddbec
      Magnus Karlsson authored
      Enable the already existing metadata copy test to also run in
      multi-buffer mode with 9K packets.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-23-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f80ddbec
    • Magnus Karlsson's avatar
      selftests/xsk: add invalid descriptor test for multi-buffer · 69760449
      Magnus Karlsson authored
      Add a test that produces lots of nasty descriptors testing the corner
      cases of the descriptor validation. Some of these descriptors are
      valid and some are not as indicated by the valid flag. For a
      description of all the test combinations, please see the code.
      
      To stress the API, we need to be able to generate combinations of
      descriptors that make little sense. A new verbatim mode is introduced
      for the packet_stream to accomplish this. In this mode, all packets in
      the packet_stream are sent as is. We do not try to chop them up into
      frames that are of the right size that we know are going to work as we
      would normally do. The packets are just written into the Tx ring even
      if we know they make no sense.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # adjusted valid flags for frags
      Link: https://lore.kernel.org/r/20230719132421.584801-22-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69760449
    • Magnus Karlsson's avatar
      selftests/xsk: add unaligned mode test for multi-buffer · 1005a226
      Magnus Karlsson authored
      Add a test for multi-buffer AF_XDP when using unaligned mode. The test
      sends 4096 9K-buffers.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-21-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1005a226
    • Magnus Karlsson's avatar
      selftests/xsk: add basic multi-buffer test · f540d44e
      Magnus Karlsson authored
      Add the first basic multi-buffer test that sends a stream of 9K
      packets and validates that they are received at the other end. In
      order to enable sending and receiving multi-buffer packets, code that
      sets the MTU is introduced as well as modifications to the XDP
      programs so that they signal that they are multi-buffer enabled.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-20-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f540d44e
    • Magnus Karlsson's avatar
      selftests/xsk: transmit and receive multi-buffer packets · 17f1034d
      Magnus Karlsson authored
      Add the ability to send and receive packets that are larger than the
      size of a umem frame, using the AF_XDP /XDP multi-buffer
      support. There are three pieces of code that need to be changed to
      achieve this: the Rx path, the Tx path, and the validation logic.
      
      Both the Rx path and Tx could only deal with a single fragment per
      packet. The Tx path is extended with a new function called
      pkt_nb_frags() that can be used to retrieve the number of fragments a
      packet will consume. We then create these many fragments in a loop and
      fill the N-1 first ones to the max size limit to use the buffer space
      efficiently, and the Nth one with whatever data that is left. This
      goes on until we have filled in at the most BATCH_SIZE worth of
      descriptors and fragments. If we detect that the next packet would
      lead to BATCH_SIZE number of fragments sent being exceeded, we do not
      send this packet and finish the batch. This packet is instead sent in
      the next iteration of BATCH_SIZE fragments.
      
      For Rx, we loop over all fragments we receive as usual, but for every
      descriptor that we receive we call a new validation function called
      is_frag_valid() to validate the consistency of this fragment. The code
      then checks if the packet continues in the next frame. If so, it loops
      over the next packet and performs the same validation. once we have
      received the last fragment of the packet we also call the function
      is_pkt_valid() to validate the packet as a whole. If we get to the end
      of the batch and we are not at the end of the current packet, we back
      out the partial packet and end the loop. Once we get into the receive
      loop next time, we start over from the beginning of that packet. This
      so the code becomes simpler at the cost of some performance.
      
      The validation function is_frag_valid() checks that the sequence and
      packet numbers are correct at the start and end of each fragment.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-19-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      17f1034d
    • Magnus Karlsson's avatar
      xsk: add multi-buffer documentation · 49ca37d0
      Magnus Karlsson authored
      Add AF_XDP multi-buffer support documentation including two
      pseudo-code samples.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-18-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      49ca37d0
    • Tirthendu Sarkar's avatar
      i40e: xsk: add TX multi-buffer support · a92b96c4
      Tirthendu Sarkar authored
      Set eop bit in TX desc command only for the last descriptor of the
      packet and do not set for all preceding descriptors.
      Signed-off-by: default avatarTirthendu Sarkar <tirthendu.sarkar@intel.com>
      Link: https://lore.kernel.org/r/20230719132421.584801-17-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a92b96c4