• Martin KaFai Lau's avatar
    bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
    Martin KaFai Lau authored
    The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
    is a kernel struct with its func ptr implemented in bpf prog.
    This new map is the interface to register/unregister/introspect
    a bpf implemented kernel struct.
    
    The kernel struct is actually embedded inside another new struct
    (or called the "value" struct in the code).  For example,
    "struct tcp_congestion_ops" is embbeded in:
    struct bpf_struct_ops_tcp_congestion_ops {
    	refcount_t refcnt;
    	enum bpf_struct_ops_state state;
    	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
    }
    The map value is "struct bpf_struct_ops_tcp_congestion_ops".
    The "bpftool map dump" will then be able to show the
    state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
    number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
    is created automatically by a macro.  Having a separate "value" struct
    will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
    "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
    initialization works before registering the struct_ops to the kernel
    subsystem).  The libbpf will take care of finding and populating the
    "struct bpf_struct_ops_XYZ" from "struct XYZ".
    
    Register a struct_ops to a kernel subsystem:
    1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
    2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
       set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
       running kernel.
       Instead of reusing the attr->btf_value_type_id,
       btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
       used as the "user" btf which could store other useful sysadmin/debug
       info that may be introduced in the furture,
       e.g. creation-date/compiler-details/map-creator...etc.
    3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
       in the running kernel btf.  Populate the value of this object.
       The function ptr should be populated with the prog fds.
    4. Call BPF_MAP_UPDATE with the object created in (3) as
       the map value.  The key is always "0".
    
    During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
    args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
    the specific struct_ops to do some final checks in "st_ops->init_member()"
    (e.g. ensure all mandatory func ptrs are implemented).
    If everything looks good, it will register this kernel struct
    to the kernel subsystem.  The map will not allow further update
    from this point.
    
    Unregister a struct_ops from the kernel subsystem:
    BPF_MAP_DELETE with key "0".
    
    Introspect a struct_ops:
    BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
    have the prog _id_ populated as the func ptr.
    
    The map value state (enum bpf_struct_ops_state) will transit from:
    INIT (map created) =>
    INUSE (map updated, i.e. reg) =>
    TOBEFREE (map value deleted, i.e. unreg)
    
    The kernel subsystem needs to call bpf_struct_ops_get() and
    bpf_struct_ops_put() to manage the "refcnt" in the
    "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
    for the purose of tracking the subsystem usage.  Another approach
    is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
    the subsystem's usage by doing map->refcnt - map->usercnt to filter out
    the map-fd/pinned-map usage.  However, that will also tie down the
    future semantics of map->refcnt and map->usercnt.
    
    The very first subsystem's refcnt (during reg()) holds one
    count to map->refcnt.  When the very last subsystem's refcnt
    is gone, it will also release the map->refcnt.  All bpf_prog will be
    freed when the map->refcnt reaches 0 (i.e. during map_free()).
    
    Here is how the bpftool map command will look like:
    [root@arch-fb-vm1 bpf]# bpftool map show
    6: struct_ops  name dctcp  flags 0x0
    	key 4B  value 256B  max_entries 1  memlock 4096B
    	btf_id 6
    [root@arch-fb-vm1 bpf]# bpftool map dump id 6
    [{
            "value": {
                "refcnt": {
                    "refs": {
                        "counter": 1
                    }
                },
                "state": 1,
                "data": {
                    "list": {
                        "next": 0,
                        "prev": 0
                    },
                    "key": 0,
                    "flags": 2,
                    "init": 24,
                    "release": 0,
                    "ssthresh": 25,
                    "cong_avoid": 30,
                    "set_state": 27,
                    "cwnd_event": 28,
                    "in_ack_event": 26,
                    "undo_cwnd": 29,
                    "pkts_acked": 0,
                    "min_tso_segs": 0,
                    "sndbuf_expand": 0,
                    "cong_control": 0,
                    "get_info": 0,
                    "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                    ],
                    "owner": 0
                }
            }
        }
    ]
    
    Misc Notes:
    * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
      It does an inplace update on "*value" instead returning a pointer
      to syscall.c.  Otherwise, it needs a separate copy of "zero" value
      for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
    
    * The bpf_struct_ops_map_delete_elem() is also called without
      preempt_disable() from map_delete_elem().  It is because
      the "->unreg()" may requires sleepable context, e.g.
      the "tcp_unregister_congestion_control()".
    
    * "const" is added to some of the existing "struct btf_func_model *"
      function arg to avoid a compiler warning caused by this patch.
    Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
    Acked-by: default avatarYonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
    85d33df3
trampoline.c 7.57 KB