1. 13 Mar, 2023 2 commits
    • David Vernet's avatar
      tasks: Extract rcu_users out of union · 22df776a
      David Vernet authored
      In commit 3fbd7ee2 ("tasks: Add a count of task RCU users"), a
      count on the number of RCU users was added to struct task_struct. This
      was done so as to enable the removal of task_rcu_dereference(), and
      allow tasks to be protected by RCU even after exiting and being removed
      from the runqueue. In this commit, the 'refcount_t rcu_users' field that
      keeps track of this refcount was put into a union co-located with
      'struct rcu_head rcu', so as to avoid taking up any extra space in
      task_struct. This was possible to do safely, because the field was only
      ever decremented by a static set of specific callers, and then never
      incremented again.
      
      While this restriction of there only being a small, static set of users
      of this field has worked fine, it prevents us from leveraging the field
      to use RCU to protect tasks in other contexts.
      
      During tracing, for example, it would be useful to be able to collect
      some tasks that performed a certain operation, put them in a map, and
      then periodically summarize who they are, which cgroup they're in, how
      much CPU time they've utilized, etc. While this can currently be done
      with 'usage', it becomes tricky when a task is already in a map, or if a
      reference should only be taken if a task is valid and will not soon be
      reaped. Ideally, we could do something like pass a reference to a map
      value, and then try to acquire a reference to the task in an RCU read
      region by using refcount_inc_not_zero().
      
      Similarly, in sched_ext, schedulers are using integer pids to remember
      tasks, and then looking them up with find_task_by_pid_ns(). This is
      slow, error prone, and adds complexity. It would be more convenient and
      performant if BPF schedulers could instead store tasks directly in maps,
      and then leverage RCU to ensure they can be safely accessed with low
      overhead.
      
      Finally, overloading fields like this is error prone. Someone that wants
      to use 'rcu_users' could easily overlook the fact that once the rcu
      callback is scheduled, the refcount will go back to being nonzero, thus
      precluding the use of refcount_inc_not_zero(). Furthermore, as described
      below, it's possible to extract the fields of the union without changing
      the size of task_struct.
      
      There are several possible ways to enable this:
      
      1. The lightest touch approach is likely the one proposed in this patch,
         which is to simply extract 'rcu_users' and 'rcu' from the union, so
         that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users'
         refcount. If we have a trusted task pointer, this would allow us to
         use refcnt_inc_not_zero() inside of an RCU region to determine if we
         can safely acquire a reference to the task and store it in a map. As
         mentioned below, this can be done without changing the size of
         task_struct, by moving the location of the union to another location
         that has padding gaps we can fill in.
      
      2. Removing 'refcount_t rcu_users', and instead having the entire task
         be freed in an rcu callback. This is likely the most sound overall
         design, though it changes the behavioral semantics exposed to
         callers, who currently expect that a task that's successfully looked
         up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
         'usage' reference acquired on them, as it's guaranteed to be >
         0 until after the next gp. In order for this approach to work, we'd
         have to audit all callers. This approach also slightly changes
         behavior observed by user space by not invoking
         trace_sched_process_free() until the whole task_struct is actually being
         freed, rather than just after it's exited. It also may change
         timings, as memory will be freed in an RCU callback rather than
         immediately when the final 'usage' refcount drops to 0. This also is
         arguably a benefit, as it provides more predictable performance to
         callers who are refcounting tasks.
      
      3. There may be other solutions as well that don't require changing the
         layout of task_struct. For example, we could possibly do something
         complex from the BPF side, such as listen for task exit and remove a
         task from a map when the task is exiting. This would likely require
         significant custom handling for task_struct in the verifier, so a
         more generalizable solution is likely warranted.
      
      As mentioned above, this patch proposes the lightest-touch approach
      which allows callers elsewhere in the kernel to use 'rcu_users' to
      ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from
      the union. There is no size change in task_struct with this patch.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      22df776a
    • Andrii Nakryiko's avatar
      bpf: fix precision propagation verbose logging · 34f0677e
      Andrii Nakryiko authored
      Fix wrong order of frame index vs register/slot index in precision
      propagation verbose (level 2) output. It's wrong and very confusing as is.
      
      Fixes: 529409ea ("bpf: propagate precision across all frames, not just the last one")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      34f0677e
  2. 11 Mar, 2023 4 commits
    • Alexei Starovoitov's avatar
      Merge branch 'Support stashing local kptrs with bpf_kptr_xchg' · 49b5300f
      Alexei Starovoitov authored
      Dave Marchevsky says:
      
      ====================
      
      Local kptrs are kptrs allocated via bpf_obj_new with a type specified in program
      BTF. A BPF program which creates a local kptr has exclusive control of the
      lifetime of the kptr, and, prior to terminating, must:
      
        * free the kptr via bpf_obj_drop
        * If the kptr is a {list,rbtree} node, add the node to a {list, rbtree},
          thereby passing control of the lifetime to the collection
      
      This series adds a third option:
      
        * stash the kptr in a map value using bpf_kptr_xchg
      
      As indicated by the use of "stash" to describe this behavior, the intended use
      of this feature is temporary storage of local kptrs. For example, a sched_ext
      ([0]) scheduler may want to create an rbtree node for each new cgroup on cgroup
      init, but to add that node to the rbtree as part of a separate program which
      runs on enqueue. Stashing the node in a map_value allows its lifetime to outlive
      the execution of the cgroup_init program.
      
      Behavior:
      
      There is no semantic difference between adding a kptr to a graph collection and
      "stashing" it in a map. In both cases exclusive ownership of the kptr's lifetime
      is passed to some containing data structure, which is responsible for
      bpf_obj_drop'ing it when the container goes away.
      
      Since graph collections also expect exclusive ownership of the nodes they
      contain, graph nodes cannot be both stashed in a map_value and contained by
      their corresponding collection.
      
      Implementation:
      
      Two observations simplify the verifier changes for this feature. First, kptrs
      ("referenced kptrs" until a recent renaming) require registration of a
      dtor function as part of their acquire/release semantics, so that a referenced
      kptr which is placed in a map_value is properly released when the map goes away.
      We want this exact behavior for local kptrs, but with bpf_obj_drop as the dtor
      instead of a per-btf_id dtor.
      
      The second observation is that, in terms of identification, "referenced kptr"
      and "local kptr" already don't interfere with one another. Consider the
      following example:
      
        struct node_data {
                long key;
                long data;
                struct bpf_rb_node node;
        };
      
        struct map_value {
                struct node_data __kptr *node;
        };
      
        struct {
                __uint(type, BPF_MAP_TYPE_ARRAY);
                __type(key, int);
                __type(value, struct map_value);
                __uint(max_entries, 1);
        } some_nodes SEC(".maps");
      
        struct map_value *mapval;
        struct node_data *res;
        int key = 0;
      
        res = bpf_obj_new(typeof(*res));
        if (!res) { /* err handling */ }
      
        mapval = bpf_map_lookup_elem(&some_nodes, &key);
        if (!mapval) { /* err handling */ }
      
        res = bpf_kptr_xchg(&mapval->node, res);
        if (res)
                bpf_obj_drop(res);
      
      The __kptr tag identifies map_value's node as a referenced kptr, while the
      PTR_TO_BTF_ID which bpf_obj_new returns - a type in some non-vmlinux,
      non-module BTF - identifies res as a local kptr. Type tag on the pointer
      indicates referenced kptr, while the type of the pointee indicates local kptr.
      So using existing facilities we can tell the verifier about a "referenced kptr"
      pointer to a "local kptr" pointee.
      
      When kptr_xchg'ing a kptr into a map_value, the verifier can recognize local
      kptr types and treat them like referenced kptrs with a properly-typed
      bpf_obj_drop as a dtor.
      
      Other implementation notes:
        * We don't need to do anything special to enforce "graph nodes cannot be
          both stashed in a map_value and contained by their corresponding collection"
          * bpf_kptr_xchg both returns and takes as input a (possibly-null) owning
            reference. It does not accept non-owning references as input by virtue
            of requiring a ref_obj_id. By definition, if a program has an owning
            ref to a node, the node isn't in a collection, so it's safe to pass
            ownership via bpf_kptr_xchg.
      
      Summary of patches:
      
        * Patch 1 modifies BTF plumbing to support using bpf_obj_drop as a dtor
        * Patch 2 adds verifier plumbing to support MEM_ALLOC-flagged param for
          bpf_kptr_xchg
        * Patch 3 adds selftests exercising the new behavior
      
      Changelog:
      
      v1 -> v2: https://lore.kernel.org/bpf/20230309180111.1618459-1-davemarchevsky@fb.com/
      
      Patch #s used below refer to the patch's position in v1 unless otherwise
      specified.
      
      Patches 1-3 were applied and are not included in v2.
      Rebase onto latest bpf-next: "libbpf: Revert poisoning of strlcpy"
      
      Patch 4: "bpf: Support __kptr to local kptrs"
        * Remove !btf_is_kernel(btf) check, WARN_ON_ONCE instead (Alexei)
      
      Patch 6: "selftests/bpf: Add local kptr stashing test"
        * Add test which stashes 2 nodes and later unstashes one of them using a
          separate BPF program (Alexei)
        * Fix incorrect runner subtest name for original test (was
          "rbtree_add_nodes")
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      49b5300f
    • Dave Marchevsky's avatar
      selftests/bpf: Add local kptr stashing test · 5d8d6634
      Dave Marchevsky authored
      Add a new selftest, local_kptr_stash, which uses bpf_kptr_xchg to stash
      a bpf_obj_new-allocated object in a map. Test the following scenarios:
      
        * Stash two rb_nodes in an arraymap, don't unstash them, rely on map
          free to destruct them
        * Stash two rb_nodes in an arraymap, unstash the second one in a
          separate program, rely on map free to destruct first
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230310230743.2320707-4-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d8d6634
    • Dave Marchevsky's avatar
      bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg · 738c96d5
      Dave Marchevsky authored
      The previous patch added necessary plumbing for verifier and runtime to
      know what to do with non-kernel PTR_TO_BTF_IDs in map values, but didn't
      provide any way to get such local kptrs into a map value. This patch
      modifies verifier handling of bpf_kptr_xchg to allow MEM_ALLOC kptr
      types.
      
      check_reg_type is modified accept MEM_ALLOC-flagged input to
      bpf_kptr_xchg despite such types not being in btf_ptr_types. This could
      have been done with a MAYBE_MEM_ALLOC equivalent to MAYBE_NULL, but
      bpf_kptr_xchg is the only helper that I can forsee using
      MAYBE_MEM_ALLOC, so keep it special-cased for now.
      
      The verifier tags bpf_kptr_xchg retval MEM_ALLOC if and only if the BTF
      associated with the retval is not kernel BTF.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230310230743.2320707-3-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      738c96d5
    • Dave Marchevsky's avatar
      bpf: Support __kptr to local kptrs · c8e18754
      Dave Marchevsky authored
      If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module
      BTF - it must have been allocated by bpf_obj_new and therefore must be
      free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local
      kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new.
      
      This patch adds support for treating __kptr-tagged pointers to "local
      kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr
      acquire / release semantics. Consider the following example:
      
        struct node_data {
                long key;
                long data;
                struct bpf_rb_node node;
        };
      
        struct map_value {
                struct node_data __kptr *node;
        };
      
        struct {
                __uint(type, BPF_MAP_TYPE_ARRAY);
                __type(key, int);
                __type(value, struct map_value);
                __uint(max_entries, 1);
        } some_nodes SEC(".maps");
      
      If struct node_data had a matching definition in kernel BTF, the verifier would
      expect a destructor for the type to be registered. Since struct node_data does
      not match any type in kernel BTF, the verifier knows that there is no kfunc
      that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can
      only come from bpf_obj_new. So instead of searching for a registered dtor,
      a bpf_obj_drop dtor can be assumed.
      
      This allows the runtime to properly destruct such kptrs in
      bpf_obj_free_fields, which enables maps to clean up map_vals w/ such
      kptrs when going away.
      
      Implementation notes:
        * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr.
          Before this patch, the variable would only ever point to vmlinux or
          module BTFs, but now it can point to some program BTF for local kptr
          type. It's later used to populate the (btf, btf_id) pair in kptr btf
          field.
        * It's necessary to btf_get the program BTF when populating btf_field
          for local kptr. btf_record_free later does a btf_put.
        * Behavior for non-local referenced kptrs is not modified, as
          bpf_find_btf_id helper only searches vmlinux and module BTFs for
          matching BTF type. If such a type is found, btf_field_kptr's btf will
          pass btf_is_kernel check, and the associated release function is
          some one-argument dtor. If btf_is_kernel check fails, associated
          release function is two-arg bpf_obj_drop_impl. Before this patch
          only btf_field_kptr's w/ kernel or module BTFs were created.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c8e18754
  3. 10 Mar, 2023 27 commits
  4. 09 Mar, 2023 7 commits