• Daniel Borkmann's avatar
    bpf: add support for persistent maps/progs · b2197755
    Daniel Borkmann authored
    This work adds support for "persistent" eBPF maps/programs. The term
    "persistent" is to be understood that maps/programs have a facility
    that lets them survive process termination. This is desired by various
    eBPF subsystem users.
    
    Just to name one example: tc classifier/action. Whenever tc parses
    the ELF object, extracts and loads maps/progs into the kernel, these
    file descriptors will be out of reach after the tc instance exits.
    So a subsequent tc invocation won't be able to access/relocate on this
    resource, and therefore maps cannot easily be shared, f.e. between the
    ingress and egress networking data path.
    
    The current workaround is that Unix domain sockets (UDS) need to be
    instrumented in order to pass the created eBPF map/program file
    descriptors to a third party management daemon through UDS' socket
    passing facility. This makes it a bit complicated to deploy shared
    eBPF maps or programs (programs f.e. for tail calls) among various
    processes.
    
    We've been brainstorming on how we could tackle this issue and various
    approches have been tried out so far, which can be read up further in
    the below reference.
    
    The architecture we eventually ended up with is a minimal file system
    that can hold map/prog objects. The file system is a per mount namespace
    singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
    mounts within a given namespace will point to the same instance. The
    file system allows for creating a user-defined directory structure.
    The objects for maps/progs are created/fetched through bpf(2) with
    two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
    along with a pathname is being passed to bpf(2) that in turn creates
    (we call it eBPF object pinning) the file system nodes. Only the pathname
    is being passed to bpf(2) for getting a new BPF file descriptor to an
    existing node. The user can use that to access maps and progs later on,
    through bpf(2). Removal of file system nodes is being managed through
    normal VFS functions such as unlink(2), etc. The file system code is
    kept to a very minimum and can be further extended later on.
    
    The next step I'm working on is to add dump eBPF map/prog commands
    to bpf(2), so that a specification from a given file descriptor can
    be retrieved. This can be used by things like CRIU but also applications
    can inspect the meta data after calling BPF_OBJ_GET.
    
    Big thanks also to Alexei and Hannes who significantly contributed
    in the design discussion that eventually let us end up with this
    architecture here.
    
    Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    b2197755
syscall.c 16.6 KB