1. 26 Mar, 2023 24 commits
  2. 25 Mar, 2023 4 commits
    • Alexei Starovoitov's avatar
      Merge branch 'Don't invoke KPTR_REF destructor on NULL xchg' · 496f4f1b
      Alexei Starovoitov authored
      David Vernet says:
      
      ====================
      
      When a map value is being freed, we loop over all of the fields of the
      corresponding BPF object and issue the appropriate cleanup calls
      corresponding to the field's type. If the field is a referenced kptr, we
      atomically xchg the value out of the map, and invoke the kptr's
      destructor on whatever was there before.
      
      Currently, we always invoke the destructor (or bpf_obj_drop() for a
      local kptr) on any kptr, including if no value was xchg'd out of the
      map. This means that any function serving as the kptr's KF_RELEASE
      destructor must always treat the argument as possibly NULL, and we
      invoke unnecessary (and seemingly unsafe) cleanup logic for the local
      kptr path as well.
      
      This is an odd requirement -- KF_RELEASE kfuncs that are invoked by BPF
      programs do not have this restriction, and the verifier will fail to
      load the program if the register containing the to-be-released type has
      any untrusted modifiers (e.g. PTR_UNTRUSTED or PTR_MAYBE_NULL). So as to
      simplify the expectations required for a KF_RELEASE kfunc, this patch
      set updates the KPTR_REF destructor logic to only be invoked when a
      non-NULL value is xchg'd out of the map.
      
      Additionally, the patch removes now-unnecessary KF_RELEASE calls from
      several kfuncs, and finally, updates the verifier to have KF_RELEASE
      automatically imply KF_TRUSTED_ARGS. This restriction was already
      implicitly happening because of the aforementioned logic in the verifier
      to reject any regs with untrusted modifiers, and to enforce that
      KF_RELEASE args are passed with a 0 offset. This change just updates the
      behavior to match that of other trusted args. This patch is left to the
      end of the series in case it happens to be controversial, as it arguably
      is slightly orthogonal to the purpose of the rest of the series.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      496f4f1b
    • David Vernet's avatar
      bpf: Treat KF_RELEASE kfuncs as KF_TRUSTED_ARGS · 6c831c46
      David Vernet authored
      KF_RELEASE kfuncs are not currently treated as having KF_TRUSTED_ARGS,
      even though they have a superset of the requirements of KF_TRUSTED_ARGS.
      Like KF_TRUSTED_ARGS, KF_RELEASE kfuncs require a 0-offset argument, and
      don't allow NULL-able arguments. Unlike KF_TRUSTED_ARGS which require
      _either_ an argument with ref_obj_id > 0, _or_ (ref->type &
      BPF_REG_TRUSTED_MODIFIERS) (and no unsafe modifiers allowed), KF_RELEASE
      only allows for ref_obj_id > 0.  Because KF_RELEASE today doesn't
      automatically imply KF_TRUSTED_ARGS, some of these requirements are
      enforced in different ways that can make the behavior of the verifier
      feel unpredictable. For example, a KF_RELEASE kfunc with a NULL-able
      argument will currently fail in the verifier with a message like, "arg#0
      is ptr_or_null_ expected ptr_ or socket" rather than "Possibly NULL
      pointer passed to trusted arg0". Our intention is the same, but the
      semantics are different due to implemenetation details that kfunc authors
      and BPF program writers should not need to care about.
      
      Let's make the behavior of the verifier more consistent and intuitive by
      having KF_RELEASE kfuncs imply the presence of KF_TRUSTED_ARGS. Our
      eventual goal is to have all kfuncs assume KF_TRUSTED_ARGS by default
      anyways, so this takes us a step in that direction.
      
      Note that it does not make sense to assume KF_TRUSTED_ARGS for all
      KF_ACQUIRE kfuncs. KF_ACQUIRE kfuncs can have looser semantics than
      KF_RELEASE, with e.g. KF_RCU | KF_RET_NULL. We may want to have
      KF_ACQUIRE imply KF_TRUSTED_ARGS _unless_ KF_RCU is specified, but that
      can be left to another patch set, and there are no such subtleties to
      address for KF_RELEASE.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230325213144.486885-4-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6c831c46
    • David Vernet's avatar
      bpf: Remove now-unnecessary NULL checks for KF_RELEASE kfuncs · fb2211a5
      David Vernet authored
      Now that we're not invoking kfunc destructors when the kptr in a map was
      NULL, we no longer require NULL checks in many of our KF_RELEASE kfuncs.
      This patch removes those NULL checks.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230325213144.486885-3-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fb2211a5
    • David Vernet's avatar
      bpf: Only invoke kptr dtor following non-NULL xchg · 1431d0b5
      David Vernet authored
      When a map value is being freed, we loop over all of the fields of the
      corresponding BPF object and issue the appropriate cleanup calls
      corresponding to the field's type. If the field is a referenced kptr, we
      atomically xchg the value out of the map, and invoke the kptr's
      destructor on whatever was there before (or bpf_obj_drop() it if it was
      a local kptr).
      
      Currently, we always invoke the destructor (either bpf_obj_drop() or the
      kptr's registered destructor) on any KPTR_REF-type field in a map, even
      if there wasn't a value in the map. This means that any function serving
      as the kptr's KF_RELEASE destructor must always treat the argument as
      possibly NULL, as the following can and regularly does happen:
      
      void *xchgd_field;
      
      /* No value was in the map, so xchgd_field is NULL */
      xchgd_field = (void *)xchg(unsigned long *field_ptr, 0);
      field->kptr.dtor(xchgd_field);
      
      These are odd semantics to impose on KF_RELEASE kfuncs -- BPF programs
      are prohibited by the verifier from passing NULL pointers to KF_RELEASE
      kfuncs, so it doesn't make sense to require this of BPF programs, but
      not the main kernel destructor path. It's also unnecessary to invoke any
      cleanup logic for local kptrs. If there is no object there, there's
      nothing to drop.
      
      So as to allow KF_RELEASE kfuncs to fully assume that an argument is
      non-NULL, this patch updates a KPTR_REF's destructor to only be invoked
      when a non-NULL value is xchg'd out of the kptr map field.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230325213144.486885-2-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1431d0b5
  3. 24 Mar, 2023 1 commit
  4. 23 Mar, 2023 10 commits
    • Martin KaFai Lau's avatar
      Merge branch 'Transit between BPF TCP congestion controls.' · 226bc6ae
      Martin KaFai Lau authored
      Kui-Feng Lee says:
      
      ====================
      
      Major changes:
      
       - Create bpf_links in the kernel for BPF struct_ops to register and
         unregister it.
      
       - Enables switching between implementations of bpf-tcp-cc under a
         name instantly by replacing the backing struct_ops map of a
         bpf_link.
      
      Previously, BPF struct_ops didn't go off, as even when the user
      program creating it was terminated, none of these ever were pinned.
      For instance, the TCP congestion control subsystem indirectly
      maintains a reference count on the struct_ops of any registered BPF
      implemented algorithm. Thus, the algorithm won't be deactivated until
      someone deliberately unregisters it.  For compatibility with other BPF
      programs, bpf_links have been created to work in coordination with
      struct_ops maps. This ensures that the registration and unregistration
      of these respective maps is carried out at the start and end of the
      bpf_link.
      
      We also faced complications when attempting to replace an existing TCP
      congestion control algorithm with a new implementation on the fly. A
      struct_ops map was used to register a TCP congestion control algorithm
      with a unique name.  We had to either register the alternative
      implementation with a new name and move over or unregister the current
      one before being able to reregistration with the same name.  To fix
      this problem, we can an option to migrate the registration of the
      algorithm from struct_ops maps to bpf_links. By modifying the backing
      map of a bpf_link, it suddenly becomes possible to replace an existing
      TCP congestion control algorithm with ease.
      ---
      
      The major differences from v11:
      
       - Fix incorrectly setting both old_prog_fd and old_map_fd.
      
      The major differences from v10:
      
       - Add old_map_fd as an additional field instead of an union in
         bpf_link_update_opts.
      
      The major differences from v9:
      
       - Add test case for BPF_F_LINK.  Includes adding old_map_fd to struct
         bpf_link_update_opts in patch 6.
      
       - Return -EPERM instead of -EINVAL when the old map fd doesn't match
         with BPF_F_LINK.
      
       - Fix -EBUSY case in bpf_map__attach_struct_ops().
      
      The major differences form v8:
      
       - Check bpf_struct_ops::{validate,update} in
         bpf_struct_ops_map_alloc()
      
      The major differences from v7:
      
       - Use synchronize_rcu_mult(call_rcu, call_rcu_tasks) to replace
         synchronize_rcu() and synchronize_rcu_tasks().
      
       - Call synchronize_rcu() in tcp_update_congestion_control().
      
       - Handle -EBUSY in bpf_map__attach_struct_ops() to allow a struct_ops
         can be used to create links more than once.  Include a test case.
      
       - Add old_map_fd to bpf_attr and handle BPF_F_REPLACE in
         bpf_struct_ops_map_link_update().
      
       - Remove changes in bpf_dummy_struct_ops.c and add a check of .update
         function pointer of bpf_struct_ops.
      
      The major differences from v6:
      
       - Reword commit logs of the patch 1, 2, and 8.
      
       - Call synchronize_rcu_tasks() as well in bpf_struct_ops_map_free().
      
       - Refactor bpf_struct_ops_map_free() so that
         bpf_struct_ops_map_alloc() can free a struct_ops without waiting
         for a RCU grace period.
      
      The major differences from v5:
      
       - Add a new step to bpf_object__load() to prepare vdata.
      
       - Accept BPF_F_REPLACE.
      
       - Check section IDs in find_struct_ops_map_by_offset()
      
       - Add a test case to check mixing w/ and w/o link struct_ops.
      
       - Add a test case of using struct_ops w/o link to update a link.
      
       - Improve bpf_link__detach_struct_ops() to handle the w/ link case.
      
      The major differences from v4:
      
       - Rebase.
      
       - Reorder patches and merge part 4 to part 2 of the v4.
      
      The major differences from v3:
      
       - Remove bpf_struct_ops_map_free_rcu(), and use synchronize_rcu().
      
       - Improve the commit log of the part 1.
      
       - Before transitioning to the READY state, we conduct a value check
         to ensure that struct_ops can be successfully utilized and links
         created later.
      
      The major differences from v2:
      
       - Simplify states
      
         - Remove TOBEUNREG.
      
         - Rename UNREG to READY.
      
       - Stop using the refcnt of the kvalue of a struct_ops. Explicitly
         increase and decrease the refcount of struct_ops.
      
       - Prepare kernel vdata during the load phase of libbpf.
      
      The major differences from v1:
      
       - Added bpf_struct_ops_link to replace the previous union-based
         approach.
      
       - Added UNREG and TOBEUNREG to the state of bpf_struct_ops_map.
      
         - bpf_struct_ops_transit_state() maintains state transitions.
      
       - Fixed synchronization issue.
      
       - Prepare kernel vdata of struct_ops during the loading phase of
         bpf_object.
      
       - Merged previous patch 3 to patch 1.
      
      v11: https://lore.kernel.org/all/20230323010409.2265383-1-kuifeng@meta.com/
      v10: https://lore.kernel.org/all/20230321232813.3376064-1-kuifeng@meta.com/
      v9: https://lore.kernel.org/all/20230320195644.1953096-1-kuifeng@meta.com/
      v8: https://lore.kernel.org/all/20230318053144.1180301-1-kuifeng@meta.com/
      v7: https://lore.kernel.org/all/20230316023641.2092778-1-kuifeng@meta.com/
      v6: https://lore.kernel.org/all/20230310043812.3087672-1-kuifeng@meta.com/
      v5: https://lore.kernel.org/all/20230308005050.255859-1-kuifeng@meta.com/
      v4: https://lore.kernel.org/all/20230307232913.576893-1-andrii@kernel.org/
      v3: https://lore.kernel.org/all/20230303012122.852654-1-kuifeng@meta.com/
      v2: https://lore.kernel.org/bpf/20230223011238.12313-1-kuifeng@meta.com/
      v1: https://lore.kernel.org/bpf/20230214221718.503964-1-kuifeng@meta.com/
      ====================
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      226bc6ae
    • Kui-Feng Lee's avatar
      selftests/bpf: Test switching TCP Congestion Control algorithms. · 06da9f3b
      Kui-Feng Lee authored
      Create a pair of sockets that utilize the congestion control algorithm
      under a particular name. Then switch up this congestion control
      algorithm to another implementation and check whether newly created
      connections using the same cc name now run the new implementation.
      
      Also, try to update a link with a struct_ops that is without
      BPF_F_LINK or with a wrong or different name.  These cases should fail
      due to the violation of assumptions.  To update a bpf_link of a
      struct_ops, it must be replaced with another struct_ops that is
      identical in type and name and has the BPF_F_LINK flag.
      
      The other test case is to create links from the same struct_ops more
      than once.  It makes sure a struct_ops can be used repeatly.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-9-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      06da9f3b
    • Kui-Feng Lee's avatar
      libbpf: Use .struct_ops.link section to indicate a struct_ops with a link. · 809a69d6
      Kui-Feng Lee authored
      Flags a struct_ops is to back a bpf_link by putting it to the
      ".struct_ops.link" section.  Once it is flagged, the created
      struct_ops can be used to create a bpf_link or update a bpf_link that
      has been backed by another struct_ops.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230323032405.3735486-8-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      809a69d6
    • Kui-Feng Lee's avatar
      libbpf: Update a bpf_link with another struct_ops. · 912dd4b0
      Kui-Feng Lee authored
      Introduce bpf_link__update_map(), which allows to atomically update
      underlying struct_ops implementation for given struct_ops BPF link.
      
      Also add old_map_fd to struct bpf_link_update_opts to handle
      BPF_F_REPLACE feature.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-7-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      912dd4b0
    • Kui-Feng Lee's avatar
      bpf: Update the struct_ops of a bpf_link. · aef56f2e
      Kui-Feng Lee authored
      By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
      to conveniently switch between different struct_ops on a single
      bpf_link. This would enable smoother transitions from one struct_ops
      to another.
      
      The struct_ops maps passing along with BPF_LINK_UPDATE should have the
      BPF_F_LINK flag.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230323032405.3735486-6-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      aef56f2e
    • Kui-Feng Lee's avatar
      libbpf: Create a bpf_link in bpf_map__attach_struct_ops(). · 8d1608d7
      Kui-Feng Lee authored
      bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
      placeholder, but now it is constructing an authentic one by calling
      bpf_link_create() if the map has the BPF_F_LINK flag.
      
      You can flag a struct_ops map with BPF_F_LINK by calling
      bpf_map__set_map_flags().
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230323032405.3735486-5-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      8d1608d7
    • Kui-Feng Lee's avatar
      bpf: Create links for BPF struct_ops maps. · 68b04864
      Kui-Feng Lee authored
      Make bpf_link support struct_ops.  Previously, struct_ops were always
      used alone without any associated links. Upon updating its value, a
      struct_ops would be activated automatically. Yet other BPF program
      types required to make a bpf_link with their instances before they
      could become active. Now, however, you can create an inactive
      struct_ops, and create a link to activate it later.
      
      With bpf_links, struct_ops has a behavior similar to other BPF program
      types. You can pin/unpin them from their links and the struct_ops will
      be deactivated when its link is removed while previously need someone
      to delete the value for it to be deactivated.
      
      bpf_links are responsible for registering their associated
      struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
      set to create a bpf_link, while a structs without this flag behaves in
      the same manner as before and is registered upon updating its value.
      
      The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
      used to craft the links for BPF struct_ops programs, but also to
      create links for BPF struct_ops them-self.  Since the links of BPF
      struct_ops programs are only used to create trampolines internally,
      they are never seen in other contexts. Thus, they can be reused for
      struct_ops themself.
      
      To maintain a reference to the map supporting this link, we add
      bpf_struct_ops_link as an additional type. The pointer of the map is
      RCU and won't be necessary until later in the patchset.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-4-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      68b04864
    • Kui-Feng Lee's avatar
      net: Update an existing TCP congestion control algorithm. · 8fb1a76a
      Kui-Feng Lee authored
      This feature lets you immediately transition to another congestion
      control algorithm or implementation with the same name.  Once a name
      is updated, new connections will apply this new algorithm.
      
      The purpose is to update a customized algorithm implemented in BPF
      struct_ops with a new version on the flight.  The following is an
      example of using the userspace API implemented in later BPF patches.
      
         link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
         .......
         err = bpf_link__update_map(link, skel->maps.ca_update_2);
      
      We first load and register an algorithm implemented in BPF struct_ops,
      then swap it out with a new one using the same name. After that, newly
      created connections will apply the updated algorithm, while older ones
      retain the previous version already applied.
      
      This patch also takes this chance to refactor the ca validation into
      the new tcp_validate_congestion_control() function.
      
      Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-3-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      8fb1a76a
    • Kui-Feng Lee's avatar
      bpf: Retire the struct_ops map kvalue->refcnt. · b671c206
      Kui-Feng Lee authored
      We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
      RCU grace period.
      
      Maintenance of kvalue->refcnt was a complicated task, as we had to
      simultaneously keep track of two reference counts: one for the
      reference count of bpf_map. When the kvalue->refcnt reaches zero, we
      also have to reduce the reference count on bpf_map - yet these steps
      are not performed in an atomic manner and require us to be vigilant
      when managing them. By eliminating kvalue->refcnt, we can make our
      maintenance more straightforward as the refcount of bpf_map is now
      solely managed!
      
      To prevent the trampoline image of a struct_ops from being released
      while it is still in use, we wait for an RCU grace period. The
      setsockopt(TCP_CONGESTION, "...") command allows you to change your
      socket's congestion control algorithm and can result in releasing the
      old struct_ops implementation. It is fine. However, this function is
      exposed through bpf_setsockopt(), it may be accessed by BPF programs
      as well. To ensure that the trampoline image belonging to struct_op
      can be safely called while its method is in use, the trampoline
      safeguarde the BPF program with rcu_read_lock(). Doing so prevents any
      destruction of the associated images before returning from a
      trampoline and requires us to wait for an RCU grace period.
      Signed-off-by: default avatarKui-Feng Lee <kuifeng@meta.com>
      Link: https://lore.kernel.org/r/20230323032405.3735486-2-kuifeng@meta.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      b671c206
    • Andrii Nakryiko's avatar
      bpf: remember meta->iter info only for initialized iters · b63cbc49
      Andrii Nakryiko authored
      For iter_new() functions iterator state's slot might not be yet
      initialized, in which case iter_get_spi() will return -ERANGE. This is
      expected and is handled properly. But for iter_next() and iter_destroy()
      cases iter slot is supposed to be initialized and correct, so -ERANGE is
      not possible.
      
      Move meta->iter.{spi,frameno} initialization into iter_next/iter_destroy
      handling branch to make it more explicit that valid information will be
      remembered in meta->iter block for subsequent use in process_iter_next_call(),
      avoiding confusingly looking -ERANGE assignment for meta->iter.spi.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230322232502.836171-1-andrii@kernel.orgSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      b63cbc49
  5. 22 Mar, 2023 1 commit