1. 09 Jan, 2020 8 commits
    • Martin KaFai Lau's avatar
      bpf: tcp: Support tcp_congestion_ops in bpf · 0baf26b0
      Martin KaFai Lau authored
      This patch makes "struct tcp_congestion_ops" to be the first user
      of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
      in bpf.
      
      The BPF implemented tcp_congestion_ops can be used like
      regular kernel tcp-cc through sysctl and setsockopt.  e.g.
      [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
      net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
      net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
      net.ipv4.tcp_congestion_control = bpf_cubic
      
      There has been attempt to move the TCP CC to the user space
      (e.g. CCP in TCP).   The common arguments are faster turn around,
      get away from long-tail kernel versions in production...etc,
      which are legit points.
      
      BPF has been the continuous effort to join both kernel and
      userspace upsides together (e.g. XDP to gain the performance
      advantage without bypassing the kernel).  The recent BPF
      advancements (in particular BTF-aware verifier, BPF trampoline,
      BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
      possible in BPF.  It allows a faster turnaround for testing algorithm
      in the production while leveraging the existing (and continue growing)
      BPF feature/framework instead of building one specifically for
      userspace TCP CC.
      
      This patch allows write access to a few fields in tcp-sock
      (in bpf_tcp_ca_btf_struct_access()).
      
      The optional "get_info" is unsupported now.  It can be added
      later.  One possible way is to output the info with a btf-id
      to describe the content.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
      0baf26b0
    • Martin KaFai Lau's avatar
      bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
      Martin KaFai Lau authored
      The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
      is a kernel struct with its func ptr implemented in bpf prog.
      This new map is the interface to register/unregister/introspect
      a bpf implemented kernel struct.
      
      The kernel struct is actually embedded inside another new struct
      (or called the "value" struct in the code).  For example,
      "struct tcp_congestion_ops" is embbeded in:
      struct bpf_struct_ops_tcp_congestion_ops {
      	refcount_t refcnt;
      	enum bpf_struct_ops_state state;
      	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
      }
      The map value is "struct bpf_struct_ops_tcp_congestion_ops".
      The "bpftool map dump" will then be able to show the
      state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
      number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
      is created automatically by a macro.  Having a separate "value" struct
      will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
      "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
      initialization works before registering the struct_ops to the kernel
      subsystem).  The libbpf will take care of finding and populating the
      "struct bpf_struct_ops_XYZ" from "struct XYZ".
      
      Register a struct_ops to a kernel subsystem:
      1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
      2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
         set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
         running kernel.
         Instead of reusing the attr->btf_value_type_id,
         btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
         used as the "user" btf which could store other useful sysadmin/debug
         info that may be introduced in the furture,
         e.g. creation-date/compiler-details/map-creator...etc.
      3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
         in the running kernel btf.  Populate the value of this object.
         The function ptr should be populated with the prog fds.
      4. Call BPF_MAP_UPDATE with the object created in (3) as
         the map value.  The key is always "0".
      
      During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
      args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
      the specific struct_ops to do some final checks in "st_ops->init_member()"
      (e.g. ensure all mandatory func ptrs are implemented).
      If everything looks good, it will register this kernel struct
      to the kernel subsystem.  The map will not allow further update
      from this point.
      
      Unregister a struct_ops from the kernel subsystem:
      BPF_MAP_DELETE with key "0".
      
      Introspect a struct_ops:
      BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
      have the prog _id_ populated as the func ptr.
      
      The map value state (enum bpf_struct_ops_state) will transit from:
      INIT (map created) =>
      INUSE (map updated, i.e. reg) =>
      TOBEFREE (map value deleted, i.e. unreg)
      
      The kernel subsystem needs to call bpf_struct_ops_get() and
      bpf_struct_ops_put() to manage the "refcnt" in the
      "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
      for the purose of tracking the subsystem usage.  Another approach
      is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
      the subsystem's usage by doing map->refcnt - map->usercnt to filter out
      the map-fd/pinned-map usage.  However, that will also tie down the
      future semantics of map->refcnt and map->usercnt.
      
      The very first subsystem's refcnt (during reg()) holds one
      count to map->refcnt.  When the very last subsystem's refcnt
      is gone, it will also release the map->refcnt.  All bpf_prog will be
      freed when the map->refcnt reaches 0 (i.e. during map_free()).
      
      Here is how the bpftool map command will look like:
      [root@arch-fb-vm1 bpf]# bpftool map show
      6: struct_ops  name dctcp  flags 0x0
      	key 4B  value 256B  max_entries 1  memlock 4096B
      	btf_id 6
      [root@arch-fb-vm1 bpf]# bpftool map dump id 6
      [{
              "value": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": 1,
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 2,
                      "init": 24,
                      "release": 0,
                      "ssthresh": 25,
                      "cong_avoid": 30,
                      "set_state": 27,
                      "cwnd_event": 28,
                      "in_ack_event": 26,
                      "undo_cwnd": 29,
                      "pkts_acked": 0,
                      "min_tso_segs": 0,
                      "sndbuf_expand": 0,
                      "cong_control": 0,
                      "get_info": 0,
                      "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                      ],
                      "owner": 0
                  }
              }
          }
      ]
      
      Misc Notes:
      * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
        It does an inplace update on "*value" instead returning a pointer
        to syscall.c.  Otherwise, it needs a separate copy of "zero" value
        for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
      
      * The bpf_struct_ops_map_delete_elem() is also called without
        preempt_disable() from map_delete_elem().  It is because
        the "->unreg()" may requires sleepable context, e.g.
        the "tcp_unregister_congestion_control()".
      
      * "const" is added to some of the existing "struct btf_func_model *"
        function arg to avoid a compiler warning caused by this patch.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
      85d33df3
    • Martin KaFai Lau's avatar
      bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS · 27ae7997
      Martin KaFai Lau authored
      This patch allows the kernel's struct ops (i.e. func ptr) to be
      implemented in BPF.  The first use case in this series is the
      "struct tcp_congestion_ops" which will be introduced in a
      latter patch.
      
      This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
      The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
      func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
      of a kernel struct.  The attr->expected_attach_type is the member
      "index" of that kernel struct.  The first member of a struct starts
      with member index 0.  That will avoid ambiguity when a kernel struct
      has multiple func ptrs with the same func signature.
      
      For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
      to implement the "init" func ptr of the "struct tcp_congestion_ops".
      The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
      of the _running_ kernel.  The attr->expected_attach_type is 3.
      
      The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
      by arch_prepare_bpf_trampoline that will be done in the next
      patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
      
      "struct bpf_struct_ops" is introduced as a common interface for the kernel
      struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
      struct will need to implement an instance of the "struct bpf_struct_ops".
      
      The supporting kernel struct also needs to implement a bpf_verifier_ops.
      During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
      bpf_verifier_ops by searching the attr->attach_btf_id.
      
      A new "btf_struct_access" is also added to the bpf_verifier_ops such
      that the supporting kernel struct can optionally provide its own specific
      check on accessing the func arg (e.g. provide limited write access).
      
      After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
      to initialize some values (e.g. the btf id of the supporting kernel
      struct) and it can only be done once the btf_vmlinux is available.
      
      The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
      if the return type of the prog->aux->attach_func_proto is "void".
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com
      27ae7997
    • Martin KaFai Lau's avatar
      bpf: Support bitfield read access in btf_struct_access · 976aba00
      Martin KaFai Lau authored
      This patch allows bitfield access as a scalar.
      
      It checks "off + size > t->size" to avoid accessing bitfield
      end up accessing beyond the struct.  This check is done
      outside of the loop since it is applicable to all access.
      
      It also takes this chance to break early on the "off < moff" case.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003501.3855427-1-kafai@fb.com
      976aba00
    • Martin KaFai Lau's avatar
      bpf: Add enum support to btf_ctx_access() · 218b3f65
      Martin KaFai Lau authored
      It allows bpf prog (e.g. tracing) to attach
      to a kernel function that takes enum argument.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003459.3855366-1-kafai@fb.com
      218b3f65
    • Martin KaFai Lau's avatar
      bpf: Avoid storing modifier to info->btf_id · 275517ff
      Martin KaFai Lau authored
      info->btf_id expects the btf_id of a struct, so it should
      store the final result after skipping modifiers (if any).
      
      It also takes this chanace to add a missing newline in one of the
      bpf_log() messages.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003456.3855176-1-kafai@fb.com
      275517ff
    • Martin KaFai Lau's avatar
      bpf: Save PTR_TO_BTF_ID register state when spilling to stack · 65726b5b
      Martin KaFai Lau authored
      This patch makes the verifier save the PTR_TO_BTF_ID register state when
      spilling to the stack.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003454.3854870-1-kafai@fb.com
      65726b5b
    • Stanislav Fomichev's avatar
      selftests/bpf: Restore original comm in test_overhead · e4300224
      Stanislav Fomichev authored
      test_overhead changes task comm in order to estimate BPF trampoline
      overhead but never sets the comm back to the original one.
      We have the tests (like core_reloc.c) that have 'test_progs'
      as hard-coded expected comm, so let's try to preserve the
      original comm.
      
      Currently, everything works because the order of execution is:
      first core_recloc, then test_overhead; but let's make it a bit
      future-proof.
      
      Other related changes: use 'test_overhead' as new comm instead of
      'test' to make it easy to debug and drop '\n' at the end.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarPetar Penkov <ppenkov@google.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200108192132.189221-1-sdf@google.com
      e4300224
  2. 08 Jan, 2020 2 commits
  3. 07 Jan, 2020 23 commits
  4. 06 Jan, 2020 7 commits