• Daniel Borkmann's avatar
    bpf: Add generic attach/detach/query API for multi-progs · 053c8e1f
    Daniel Borkmann authored
    This adds a generic layer called bpf_mprog which can be reused by different
    attachment layers to enable multi-program attachment and dependency resolution.
    In-kernel users of the bpf_mprog don't need to care about the dependency
    resolution internals, they can just consume it with few API calls.
    
    The initial idea of having a generic API sparked out of discussion [0] from an
    earlier revision of this work where tc's priority was reused and exposed via
    BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
    as-is for classic tc BPF. The feedback was that priority provides a bad user
    experience and is hard to use [1], e.g.:
    
      I cannot help but feel that priority logic copy-paste from old tc, netfilter
      and friends is done because "that's how things were done in the past". [...]
      Priority gets exposed everywhere in uapi all the way to bpftool when it's
      right there for users to understand. And that's the main problem with it.
    
      The user don't want to and don't need to be aware of it, but uapi forces them
      to pick the priority. [...] Your cover letter [0] example proves that in
      real life different service pick the same priority. They simply don't know
      any better. Priority is an unnecessary magic that apps _have_ to pick, so
      they just copy-paste and everyone ends up using the same.
    
    The course of the discussion showed more and more the need for a generic,
    reusable API where the "same look and feel" can be applied for various other
    program types beyond just tc BPF, for example XDP today does not have multi-
    program support in kernel, but also there was interest around this API for
    improving management of cgroup program types. Such common multi-program
    management concept is useful for BPF management daemons or user space BPF
    applications coordinating internally about their attachments.
    
    Both from Cilium and Meta side [2], we've collected the following requirements
    for a generic attach/detach/query API for multi-progs which has been implemented
    as part of this work:
    
      - Support prog-based attach/detach and link API
      - Dependency directives (can also be combined):
        - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
          - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
            space application does not need CAP_SYS_ADMIN to retrieve foreign fds
            via bpf_*_get_fd_by_id()
          - BPF_F_LINK flag as {prog,link} toggle
          - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
            BPF_F_AFTER will just append for attaching
          - Enforced only at attach time
        - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
          own infra for replacing their internal prog
        - If no flags are set, then it's default append behavior for attaching
      - Internal revision counter and optionally being able to pass expected_revision
      - User space application can query current state with revision, and pass it
        along for attachment to assert current state before doing updates
      - Query also gets extension for link_ids array and link_attach_flags:
        - prog_ids are always filled with program IDs
        - link_ids are filled with link IDs when link was used, otherwise 0
        - {prog,link}_attach_flags for holding {prog,link}-specific flags
      - Must be easy to integrate/reuse for in-kernel users
    
    The uapi-side changes needed for supporting bpf_mprog are rather minimal,
    consisting of the additions of the attachment flags, revision counter, and
    expanding existing union with relative_{fd,id} member.
    
    The bpf_mprog framework consists of an bpf_mprog_entry object which holds
    an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
    structure) is part of bpf_mprog_bundle. Both have been separated, so that
    fast-path gets efficient packing of bpf_prog pointers for maximum cache
    efficiency. Also, array has been chosen instead of linked list or other
    structures to remove unnecessary indirections for a fast point-to-entry in
    tc for BPF.
    
    The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
    updates the peer bpf_mprog_entry is populated and then just swapped which
    avoids additional allocations that could otherwise fail, for example, in
    detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
    be converted to dynamic allocation if necessary at a point in future.
    Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
    of tcx which uses this API in the next patch, it piggybacks on rtnl.
    
    An extensive test suite for checking all aspects of this API for prog-based
    attach/detach and link API comes as BPF selftests in this series.
    
    Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
    management.
    
      [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
      [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
      [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    053c8e1f
MAINTAINERS 699 KB