1. 16 Apr, 2023 8 commits
    • Dave Marchevsky's avatar
      bpf: Centralize btf_field-specific initialization logic · 3e81740a
      Dave Marchevsky authored
      All btf_fields in an object are 0-initialized by memset in
      bpf_obj_init. This might not be a valid initial state for some field
      types, in which case kfuncs that use the type will properly initialize
      their input if it's been 0-initialized. Some BPF graph collection types
      and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.
      
      An earlier patch in this series added the bpf_refcount field, for which
      the 0 state indicates that the refcounted object should be free'd.
      bpf_obj_init treats this field specially, setting refcount to 1 instead
      of relying on scattered "refcount is 0? Must have just been initialized,
      let's set to 1" logic in kfuncs.
      
      This patch extends this treatment to list and rbtree field types,
      allowing most scattered initialization logic in kfuncs to be removed.
      
      Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
      they'll be 0-initialized without passing through the newly-added logic,
      so scattered initialization logic must remain for these collection root
      types.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3e81740a
    • Dave Marchevsky's avatar
      bpf: Migrate bpf_rbtree_remove to possibly fail · 404ad75a
      Dave Marchevsky authored
      This patch modifies bpf_rbtree_remove to account for possible failure
      due to the input rb_node already not being in any collection.
      The function can now return NULL, and does when the aforementioned
      scenario occurs. As before, on successful removal an owning reference to
      the removed node is returned.
      
      Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL |
      KF_ACQUIRE - provides the desired verifier semantics:
      
        * retval must be checked for NULL before use
        * if NULL, retval's ref_obj_id is released
        * retval is a "maybe acquired" owning ref, not a non-owning ref,
          so it will live past end of critical section (bpf_spin_unlock), and
          thus can be checked for NULL after the end of the CS
      
      BPF programs must add checks
      ============================
      
      This does change bpf_rbtree_remove's verifier behavior. BPF program
      writers will need to add NULL checks to their programs, but the
      resulting UX looks natural:
      
        bpf_spin_lock(&glock);
      
        n = bpf_rbtree_first(&ghead);
        if (!n) { /* ... */}
        res = bpf_rbtree_remove(&ghead, &n->node);
      
        bpf_spin_unlock(&glock);
      
        if (!res)  /* Newly-added check after this patch */
          return 1;
      
        n = container_of(res, /* ... */);
        /* Do something else with n */
        bpf_obj_drop(n);
        return 0;
      
      The "if (!res)" check above is the only addition necessary for the above
      program to pass verification after this patch.
      
      bpf_rbtree_remove no longer clobbers non-owning refs
      ====================================================
      
      An issue arises when bpf_rbtree_remove fails, though. Consider this
      example:
      
        struct node_data {
          long key;
          struct bpf_list_node l;
          struct bpf_rb_node r;
          struct bpf_refcount ref;
        };
      
        long failed_sum;
      
        void bpf_prog()
        {
          struct node_data *n = bpf_obj_new(/* ... */);
          struct bpf_rb_node *res;
          n->key = 10;
      
          bpf_spin_lock(&glock);
      
          bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */
          res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */);
          if (!res)
            failed_sum += n->key;  /* not possible */
      
          bpf_spin_unlock(&glock);
          /* if (res) { do something useful and drop } ... */
        }
      
      The bpf_rbtree_remove in this example will always fail. Similarly to
      bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference
      invalidation point. The verifier clobbers all non-owning refs after a
      bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail
      verification, and in fact there's no good way to get information about
      the node which failed to add after the invalidation. This patch removes
      non-owning reference invalidation from bpf_rbtree_remove to allow the
      above usecase to pass verification. The logic for why this is now
      possible is as follows:
      
      Before this series, bpf_rbtree_add couldn't fail and thus assumed that
      its input, a non-owning reference, was in the tree. But it's easy to
      construct an example where two non-owning references pointing to the same
      underlying memory are acquired and passed to rbtree_remove one after
      another (see rbtree_api_release_aliasing in
      selftests/bpf/progs/rbtree_fail.c).
      
      So it was necessary to clobber non-owning refs to prevent this
      case and, more generally, to enforce "non-owning ref is definitely
      in some collection" invariant. This series removes that invariant and
      the failure / runtime checking added in this patch provide a clean way
      to deal with the aliasing issue - just fail to remove.
      
      Because the aliasing issue prevented by clobbering non-owning refs is no
      longer an issue, this patch removes the invalidate_non_owning_refs
      call from verifier handling of bpf_rbtree_remove. Note that
      bpf_spin_unlock - the other caller of invalidate_non_owning_refs -
      clobbers non-owning refs for a different reason, so its clobbering
      behavior remains unchanged.
      
      No BPF program changes are necessary for programs to remain valid as a
      result of this clobbering change. A valid program before this patch
      passed verification with its non-owning refs having shorter (or equal)
      lifetimes due to more aggressive clobbering.
      
      Also, update existing tests to check bpf_rbtree_remove retval for NULL
      where necessary, and move rbtree_api_release_aliasing from
      progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass
      verification.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      404ad75a
    • Dave Marchevsky's avatar
      selftests/bpf: Modify linked_list tests to work with macro-ified inserts · de67ba39
      Dave Marchevsky authored
      The linked_list tests use macros and function pointers to reduce code
      duplication. Earlier in the series, bpf_list_push_{front,back} were
      modified to be macros, expanding to invoke actual kfuncs
      bpf_list_push_{front,back}_impl. Due to this change, a code snippet
      like:
      
        void (*p)(void *, void *) = (void *)&bpf_list_##op;
        p(hexpr, nexpr);
      
      meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer
      work as it's no longer valid to do &bpf_list_push_{front,back} since
      they're no longer functions.
      
      This patch fixes issues of this type, along with two other minor changes
      - one improvement and one fix - both related to the node argument to
      list_push_{front,back}.
      
        * The fix: migration of list_push tests away from (void *, void *)
          func ptr uncovered that some tests were incorrectly passing pointer
          to node, not pointer to struct bpf_list_node within the node. This
          patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node))
      
        * The improvement: In linked_list tests, the struct foo type has two
          list_node fields: node and node2, at byte offsets 0 and 40 within
          the struct, respectively. Currently node is used in ~all tests
          involving struct foo and lists. The verifier needs to do some work
          to account for the offset of bpf_list_node within the node type, so
          using node2 instead of node exercises that logic more in the tests.
          This patch migrates linked_list tests to use node2 instead of node.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-7-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      de67ba39
    • Dave Marchevsky's avatar
      bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail · d2dcc67d
      Dave Marchevsky authored
      Consider this code snippet:
      
        struct node {
          long key;
          bpf_list_node l;
          bpf_rb_node r;
          bpf_refcount ref;
        }
      
        int some_bpf_prog(void *ctx)
        {
          struct node *n = bpf_obj_new(/*...*/), *m;
      
          bpf_spin_lock(&glock);
      
          bpf_rbtree_add(&some_tree, &n->r, /* ... */);
          m = bpf_refcount_acquire(n);
          bpf_rbtree_add(&other_tree, &m->r, /* ... */);
      
          bpf_spin_unlock(&glock);
      
          /* ... */
        }
      
      After bpf_refcount_acquire, n and m point to the same underlying memory,
      and that node's bpf_rb_node field is being used by the some_tree insert,
      so overwriting it as a result of the second insert is an error. In order
      to properly support refcounted nodes, the rbtree and list insert
      functions must be allowed to fail. This patch adds such support.
      
      The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
      return an int indicating success/failure, with 0 -> success, nonzero ->
      failure.
      
      bpf_obj_drop on failure
      =======================
      
      Currently the only reason an insert can fail is the example above: the
      bpf_{list,rb}_node is already in use. When such a failure occurs, the
      insert kfuncs will bpf_obj_drop the input node. This allows the insert
      operations to logically fail without changing their verifier owning ref
      behavior, namely the unconditional release_reference of the input
      owning ref.
      
      With insert that always succeeds, ownership of the node is always passed
      to the collection, since the node always ends up in the collection.
      
      With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
      is always passed either to the collection (success), or to bpf_obj_drop
      (failure). Regardless, it's correct to continue unconditionally
      releasing the input owning ref, as something is always taking ownership
      from the calling program on insert.
      
      Keeping owning ref behavior unchanged results in a nice default UX for
      insert functions that can fail. If the program's reaction to a failed
      insert is "fine, just get rid of this owning ref for me and let me go
      on with my business", then there's no reason to check for failure since
      that's default behavior. e.g.:
      
        long important_failures = 0;
      
        int some_bpf_prog(void *ctx)
        {
          struct node *n, *m, *o; /* all bpf_obj_new'd */
      
          bpf_spin_lock(&glock);
          bpf_rbtree_add(&some_tree, &n->node, /* ... */);
          bpf_rbtree_add(&some_tree, &m->node, /* ... */);
          if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
            important_failures++;
          }
          bpf_spin_unlock(&glock);
        }
      
      If we instead chose to pass ownership back to the program on failed
      insert - by returning NULL on success or an owning ref on failure -
      programs would always have to do something with the returned ref on
      failure. The most likely action is probably "I'll just get rid of this
      owning ref and go about my business", which ideally would look like:
      
        if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
          bpf_obj_drop(n);
      
      But bpf_obj_drop isn't allowed in a critical section and inserts must
      occur within one, so in reality error handling would become a
      hard-to-parse mess.
      
      For refcounted nodes, we can replicate the "pass ownership back to
      program on failure" logic with this patch's semantics, albeit in an ugly
      way:
      
        struct node *n = bpf_obj_new(/* ... */), *m;
      
        bpf_spin_lock(&glock);
      
        m = bpf_refcount_acquire(n);
        if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
          /* Do something with m */
        }
      
        bpf_spin_unlock(&glock);
        bpf_obj_drop(m);
      
      bpf_refcount_acquire is used to simulate "return owning ref on failure".
      This should be an uncommon occurrence, though.
      
      Addition of two verifier-fixup'd args to collection inserts
      ===========================================================
      
      The actual bpf_obj_drop kfunc is
      bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
      macro populating the second arg with 0 and the verifier later filling in
      the arg during insn fixup.
      
      Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
      bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
      passed to bpf_obj_drop_impl.
      
      Similarly, because the 'node' param to those insert functions is the
      bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
      pointer to the beginning of the node, the insert functions need to be
      able to find the beginning of the node struct. A second
      verifier-populated param is necessary: the offset of {list,rb}_node within the
      node type.
      
      These two new params allow the insert kfuncs to correctly call
      __bpf_obj_drop_impl:
      
        beginning_of_node = bpf_rb_node_ptr - offset
        if (already_inserted)
          __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);
      
      Similarly to other kfuncs with "hidden" verifier-populated params, the
      insert functions are renamed with _impl prefix and a macro is provided
      for common usage. For example, bpf_rbtree_add kfunc is now
      bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
      "hidden" args to 0.
      
      Due to the two new args BPF progs will need to be recompiled to work
      with the new _impl kfuncs.
      
      This patch also rewrites the "hidden argument" explanation to more
      directly say why the BPF program writer doesn't need to populate the
      arguments with anything meaningful.
      
      How does this new logic affect non-owning references?
      =====================================================
      
      Currently, non-owning refs are valid until the end of the critical
      section in which they're created. We can make this guarantee because, if
      a non-owning ref exists, the referent was added to some collection. The
      collection will drop() its nodes when it goes away, but it can't go away
      while our program is accessing it, so that's not a problem. If the
      referent is removed from the collection in the same CS that it was added
      in, it can't be bpf_obj_drop'd until after CS end. Those are the only
      two ways to free the referent's memory and neither can happen until
      after the non-owning ref's lifetime ends.
      
      On first glance, having these collection insert functions potentially
      bpf_obj_drop their input seems like it breaks the "can't be
      bpf_obj_drop'd until after CS end" line of reasoning. But we care about
      the memory not being _freed_ until end of CS end, and a previous patch
      in the series modified bpf_obj_drop such that it doesn't free refcounted
      nodes until refcount == 0. So the statement can be more accurately
      rewritten as "can't be free'd until after CS end".
      
      We can prove that this rewritten statement holds for any non-owning
      reference produced by collection insert functions:
      
      * If the input to the insert function is _not_ refcounted
        * We have an owning reference to the input, and can conclude it isn't
          in any collection
          * Inserting a node in a collection turns owning refs into
            non-owning, and since our input type isn't refcounted, there's no
            way to obtain additional owning refs to the same underlying
            memory
        * Because our node isn't in any collection, the insert operation
          cannot fail, so bpf_obj_drop will not execute
        * If bpf_obj_drop is guaranteed not to execute, there's no risk of
          memory being free'd
      
      * Otherwise, the input to the insert function is refcounted
        * If the insert operation fails due to the node's list_head or rb_root
          already being in some collection, there was some previous successful
          insert which passed refcount to the collection
        * We have an owning reference to the input, it must have been
          acquired via bpf_refcount_acquire, which bumped the refcount
        * refcount must be >= 2 since there's a valid owning reference and the
          node is already in a collection
        * Insert triggering bpf_obj_drop will decr refcount to >= 1, never
          resulting in a free
      
      So although we may do bpf_obj_drop during the critical section, this
      will never result in memory being free'd, and no changes to non-owning
      ref logic are needed in this patch.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d2dcc67d
    • Dave Marchevsky's avatar
      bpf: Add bpf_refcount_acquire kfunc · 7c50b1cb
      Dave Marchevsky authored
      Currently, BPF programs can interact with the lifetime of refcounted
      local kptrs in the following ways:
      
        bpf_obj_new  - Initialize refcount to 1 as part of new object creation
        bpf_obj_drop - Decrement refcount and free object if it's 0
        collection add - Pass ownership to the collection. No change to
                         refcount but collection is responsible for
      		   bpf_obj_dropping it
      
      In order to be able to add a refcounted local kptr to multiple
      collections we need to be able to increment the refcount and acquire a
      new owning reference. This patch adds a kfunc, bpf_refcount_acquire,
      implementing such an operation.
      
      bpf_refcount_acquire takes a refcounted local kptr and returns a new
      owning reference to the same underlying memory as the input. The input
      can be either owning or non-owning. To reinforce why this is safe,
      consider the following code snippets:
      
        struct node *n = bpf_obj_new(typeof(*n)); // A
        struct node *m = bpf_refcount_acquire(n); // B
      
      In the above snippet, n will be alive with refcount=1 after (A), and
      since nothing changes that state before (B), it's obviously safe. If
      n is instead added to some rbtree, we can still safely refcount_acquire
      it:
      
        struct node *n = bpf_obj_new(typeof(*n));
        struct node *m;
      
        bpf_spin_lock(&glock);
        bpf_rbtree_add(&groot, &n->node, less);   // A
        m = bpf_refcount_acquire(n);              // B
        bpf_spin_unlock(&glock);
      
      In the above snippet, after (A) n is a non-owning reference, and after
      (B) m is an owning reference pointing to the same memory as n. Although
      n has no ownership of that memory's lifetime, it's guaranteed to be
      alive until the end of the critical section, and n would be clobbered if
      we were past the end of the critical section, so it's safe to bump
      refcount.
      
      Implementation details:
      
      * From verifier's perspective, bpf_refcount_acquire handling is similar
        to bpf_obj_new and bpf_obj_drop. Like the former, it returns a new
        owning reference matching input type, although like the latter, type
        can be inferred from concrete kptr input. Verifier changes in
        {check,fixup}_kfunc_call and check_kfunc_args are largely copied from
        aforementioned functions' verifier changes.
      
      * An exception to the above is the new KF_ARG_PTR_TO_REFCOUNTED_KPTR
        arg, indicated by new "__refcounted_kptr" kfunc arg suffix. This is
        necessary in order to handle both owning and non-owning input without
        adding special-casing to "__alloc" arg handling. Also a convenient
        place to confirm that input type has bpf_refcount field.
      
      * The implemented kfunc is actually bpf_refcount_acquire_impl, with
        'hidden' second arg that the verifier sets to the type's struct_meta
        in fixup_kfunc_call.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-5-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7c50b1cb
    • Dave Marchevsky's avatar
      bpf: Support refcounted local kptrs in existing semantics · 1512217c
      Dave Marchevsky authored
      A local kptr is considered 'refcounted' when it is of a type that has a
      bpf_refcount field. When such a kptr is created, its refcount should be
      initialized to 1; when destroyed, the object should be free'd only if a
      refcount decr results in 0 refcount.
      
      Existing logic always frees the underlying memory when destroying a
      local kptr, and 0-initializes all btf_record fields. This patch adds
      checks for "is local kptr refcounted?" and new logic for that case in
      the appropriate places.
      
      This patch focuses on changing existing semantics and thus conspicuously
      does _not_ provide a way for BPF programs in increment refcount. That
      follows later in the series.
      
      __bpf_obj_drop_impl is modified to do the right thing when it sees a
      refcounted type. Container types for graph nodes (list, tree, stashed in
      map) are migrated to use __bpf_obj_drop_impl as a destructor for their
      nodes instead of each having custom destruction code in their _free
      paths. Now that "drop" isn't a synonym for "free" when the type is
      refcounted it makes sense to centralize this logic.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1512217c
    • Dave Marchevsky's avatar
      bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing · d54730b5
      Dave Marchevsky authored
      A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
      meant for use in BPF programs. Similarly to other opaque types like
      bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
      user-defined struct types a bpf_refcount can be located, so necessary
      btf_record plumbing is added to enable this. bpf_refcount is sized to
      hold a refcount_t.
      
      Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
      btf_record as refcount_off in addition to being in the field array.
      Caching refcount_off makes sense for this field because further patches
      in the series will modify functions that take local kptrs (e.g.
      bpf_obj_drop) to change their behavior if the type they're operating on
      is refcounted. So enabling fast "is this type refcounted?" checks is
      desirable.
      
      No such verifier behavior changes are introduced in this patch, just
      logic to recognize 'struct bpf_refcount' in btf_record.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d54730b5
    • Dave Marchevsky's avatar
      bpf: Remove btf_field_offs, use btf_record's fields instead · cd2a8079
      Dave Marchevsky authored
      The btf_field_offs struct contains (offset, size) for btf_record fields,
      sorted by offset. btf_field_offs is always used in conjunction with
      btf_record, which has btf_field 'fields' array with (offset, type), the
      latter of which btf_field_offs' size is derived from via
      btf_field_type_size.
      
      This patch adds a size field to struct btf_field and sorts btf_record's
      fields by offset, making it possible to get rid of btf_field_offs. Less
      data duplication and less code complexity results.
      
      Since btf_field_offs' lifetime closely followed the btf_record used to
      populate it, most complexity wins are from removal of initialization
      code like:
      
        if (btf_record_successfully_initialized) {
          foffs = btf_parse_field_offs(rec);
          if (IS_ERR_OR_NULL(foffs))
            // free the btf_record and return err
        }
      
      Other changes in this patch are pretty mechanical:
      
        * foffs->field_off[i] -> rec->fields[i].offset
        * foffs->field_sz[i] -> rec->fields[i].size
        * Sort rec->fields in btf_parse_fields before returning
          * It's possible that this is necessary independently of other
            changes in this patch. btf_record_find in syscall.c expects
            btf_record's fields to be sorted by offset, yet there's no
            explicit sorting of them before this patch, record's fields are
            populated in the order they're read from BTF struct definition.
            BTF docs don't say anything about the sortedness of struct fields.
        * All functions taking struct btf_field_offs * input now instead take
          struct btf_record *. All callsites of these functions already have
          access to the correct btf_record.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd2a8079
  2. 14 Apr, 2023 5 commits
    • Rong Tao's avatar
      samples/bpf: sampleip: Replace PAGE_OFFSET with _text address · 4a1e885c
      Rong Tao authored
      Macro PAGE_OFFSET(0xffff880000000000) in sampleip_user.c is inaccurate,
      for example, in aarch64 architecture, this value depends on the
      CONFIG_ARM64_VA_BITS compilation configuration, this value defaults to 48,
      the corresponding PAGE_OFFSET is 0xffff800000000000, if we use the value
      defined in sampleip_user.c, then all KSYMs obtained by sampleip are (user)
      
      Symbol error due to PAGE_OFFSET error:
      
          $ sudo ./sampleip 1
          Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends.
          ADDR                KSYM                             COUNT
          0xffff80000810ceb8  (user)                           1
          0xffffb28ec880      (user)                           1
          0xffff8000080c82b8  (user)                           1
          0xffffb23fed24      (user)                           1
          0xffffb28944fc      (user)                           1
          0xffff8000084628bc  (user)                           1
          0xffffb2a935c0      (user)                           1
          0xffff80000844677c  (user)                           1
          0xffff80000857a3a4  (user)                           1
          ...
      
      A few examples of addresses in the CONFIG_ARM64_VA_BITS=48 environment in
      the aarch64 environment:
      
          $ sudo head /proc/kallsyms
          ffff8000080a0000 T _text
          ffff8000080b0000 t gic_handle_irq
          ffff8000080b0000 T _stext
          ffff8000080b0000 T __irqentry_text_start
          ffff8000080b00b0 t gic_handle_irq
          ffff8000080b0230 t gic_handle_irq
          ffff8000080b03b4 T __irqentry_text_end
          ffff8000080b03b8 T __softirqentry_text_start
          ffff8000080b03c0 T __do_softirq
          ffff8000080b0718 T __entry_text_start
      
      We just need to replace the PAGE_OFFSET with the address _text in
      /proc/kallsyms to solve this problem:
      
          $ sudo ./sampleip 1
          Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends.
          ADDR                KSYM                             COUNT
          0xffffb2892ab0      (user)                           1
          0xffffb2b1edfc      (user)                           1
          0xffff800008462834  __arm64_sys_ppoll                1
          0xffff8000084b87f4  eventfd_read                     1
          0xffffb28e6788      (user)                           1
          0xffff8000081e96d8  rcu_all_qs                       1
          0xffffb2ada878      (user)                           1
          ...
      Signed-off-by: default avatarRong Tao <rongtao@cestc.cn>
      Link: https://lore.kernel.org/r/tencent_A0E82E0BEE925285F8156D540731DF805F05@qq.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4a1e885c
    • Ilya Leoshkevich's avatar
      bpf: Support 64-bit pointers to kfuncs · 1cf3bfc6
      Ilya Leoshkevich authored
      test_ksyms_module fails to emit a kfunc call targeting a module on
      s390x, because the verifier stores the difference between kfunc
      address and __bpf_call_base in bpf_insn.imm, which is s32, and modules
      are roughly (1 << 42) bytes away from the kernel on s390x.
      
      Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs,
      and storing the absolute address in bpf_kfunc_desc.
      
      Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new
      behavior to the s390x JIT. Otherwise other JITs need to be modified,
      which is not desired.
      
      Introduce bpf_get_kfunc_addr() instead of exposing both
      find_kfunc_desc() and struct bpf_kfunc_desc.
      
      In addition to sorting kfuncs by imm, also sort them by offset, in
      order to handle conflicting imms from different modules. Do this on
      all architectures in order to simplify code.
      
      Factor out resolving specialized kfuncs (XPD and dynptr) from
      fixup_kfunc_call(). This was required in the first place, because
      fixup_kfunc_call() uses find_kfunc_desc(), which returns a const
      pointer, so it's not possible to modify kfunc addr without stripping
      const, which is not nice. It also removes repetition of code like:
      
      	if (bpf_jit_supports_far_kfunc_call())
      		desc->addr = func;
      	else
      		insn->imm = BPF_CALL_IMM(func);
      
      and separates kfunc_desc_tab fixups from kfunc_call fixups.
      Suggested-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1cf3bfc6
    • Yafang's avatar
      bpf: Add preempt_count_{sub,add} into btf id deny list · c11bd046
      Yafang authored
      The recursion check in __bpf_prog_enter* and __bpf_prog_exit*
      leave preempt_count_{sub,add} unprotected. When attaching trampoline to
      them we get panic as follows,
      
      [  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28)
      [  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
      [  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
      [  867.843100] Call Trace:
      [  867.843101]  <TASK>
      [  867.843104]  asm_exc_int3+0x3a/0x40
      [  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
      [  867.843135]  __bpf_prog_enter_recur+0x17/0x90
      [  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
      [  867.843154]  ? preempt_count_sub+0x1/0xa0
      [  867.843157]  preempt_count_sub+0x5/0xa0
      [  867.843159]  ? migrate_enable+0xac/0xf0
      [  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
      [  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
      ...
      [  867.843788]  preempt_count_sub+0x5/0xa0
      [  867.843793]  ? migrate_enable+0xac/0xf0
      [  867.843829]  __bpf_prog_exit_recur+0x2d/0x40
      [  867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35)
      [  867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c)
      [  867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec)
      [  867.843842]  bpf_trampoline_6442468108_0+0x55/0x1000
      ...
      
      That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are
      called after prog->active is decreased.
      
      Fixing this by adding these two functions into btf ids deny list.
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarYafang <laoar.shao@gmail.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c11bd046
    • Alexei Starovoitov's avatar
      selftests/bpf: Workaround for older vm_sockets.h. · 75860b52
      Alexei Starovoitov authored
      Some distros ship with older vm_sockets.h that doesn't have VMADDR_CID_LOCAL
      which causes selftests build to fail:
      /tmp/work/bpf/bpf/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c:261:18: error: ‘VMADDR_CID_LOCAL’ undeclared (first use in this function); did you mean ‘VMADDR_CID_HOST’?
          261 |  addr->svm_cid = VMADDR_CID_LOCAL;
              |                  ^~~~~~~~~~~~~~~~
              |                  VMADDR_CID_HOST
      
      Workaround this issue by defining it on demand.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      75860b52
    • Alexei Starovoitov's avatar
      selftests/bpf: Fix merge conflict due to SYS() macro change. · c04135ab
      Alexei Starovoitov authored
      Fix merge conflict between bpf/bpf-next trees due to change of arguments in SYS() macro.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c04135ab
  3. 13 Apr, 2023 27 commits
    • Jakub Kicinski's avatar
      Daniel Borkmann says: · c2865b11
      Jakub Kicinski authored
      ====================
      pull-request: bpf-next 2023-04-13
      
      We've added 260 non-merge commits during the last 36 day(s) which contain
      a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).
      
      The main changes are:
      
      1) Rework BPF verifier log behavior and implement it as a rotating log
         by default with the option to retain old-style fixed log behavior,
         from Andrii Nakryiko.
      
      2) Adds support for using {FOU,GUE} encap with an ipip device operating
         in collect_md mode and add a set of BPF kfuncs for controlling encap
         params, from Christian Ehrig.
      
      3) Allow BPF programs to detect at load time whether a particular kfunc
         exists or not, and also add support for this in light skeleton,
         from Alexei Starovoitov.
      
      4) Optimize hashmap lookups when key size is multiple of 4,
         from Anton Protopopov.
      
      5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
         tasks to be stored in BPF maps, from David Vernet.
      
      6) Add support for stashing local BPF kptr into a map value via
         bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
         for new cgroups, from Dave Marchevsky.
      
      7) Fix BTF handling of is_int_ptr to skip modifiers to work around
         tracing issues where a program cannot be attached, from Feng Zhou.
      
      8) Migrate a big portion of test_verifier unit tests over to
         test_progs -a verifier_* via inline asm to ease {read,debug}ability,
         from Eduard Zingerman.
      
      9) Several updates to the instruction-set.rst documentation
         which is subject to future IETF standardization
         (https://lwn.net/Articles/926882/), from Dave Thaler.
      
      10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
          known bits information propagation, from Daniel Borkmann.
      
      11) Add skb bitfield compaction work related to BPF with the overall goal
          to make more of the sk_buff bits optional, from Jakub Kicinski.
      
      12) BPF selftest cleanups for build id extraction which stand on its own
          from the upcoming integration work of build id into struct file object,
          from Jiri Olsa.
      
      13) Add fixes and optimizations for xsk descriptor validation and several
          selftest improvements for xsk sockets, from Kal Conley.
      
      14) Add BPF links for struct_ops and enable switching implementations
          of BPF TCP cong-ctls under a given name by replacing backing
          struct_ops map, from Kui-Feng Lee.
      
      15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
          offset stack read as earlier Spectre checks cover this,
          from Luis Gerhorst.
      
      16) Fix issues in copy_from_user_nofault() for BPF and other tracers
          to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
          and Alexei Starovoitov.
      
      17) Add --json-summary option to test_progs in order for CI tooling to
          ease parsing of test results, from Manu Bretelle.
      
      18) Batch of improvements and refactoring to prep for upcoming
          bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
          from Martin KaFai Lau.
      
      19) Improve bpftool's visual program dump which produces the control
          flow graph in a DOT format by adding C source inline annotations,
          from Quentin Monnet.
      
      20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
          the module name from BTF of the target and searching kallsyms of
          the correct module, from Viktor Malik.
      
      21) Improve BPF verifier handling of '<const> <cond> <non_const>'
          to better detect whether in particular jmp32 branches are taken,
          from Yonghong Song.
      
      22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
          A built-in cc or one from a kernel module is already able to write
          to app_limited, from Yixin Shen.
      
      Conflicts:
      
      Documentation/bpf/bpf_devel_QA.rst
        b7abcd9c ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
        0f10f647 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
      https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/
      
      include/net/ip_tunnels.h
        bc9d003d ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
        ac931d4c ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
      https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/
      
      net/bpf/test_run.c
        e5995bc7 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
        294635a8 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
      https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
      ====================
      
      Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c2865b11
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 800e68c4
      Jakub Kicinski authored
      Conflicts:
      
      tools/testing/selftests/net/config
        62199e3f ("selftests: net: Add VXLAN MDB test")
        3a0385be ("selftests: add the missing CONFIG_IP_SCTP in net config")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      800e68c4
    • Linus Torvalds's avatar
      Merge tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 829cca4d
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, and bluetooth.
      
        Not all that quiet given spring celebrations, but "current" fixes are
        thinning out, which is encouraging. One outstanding regression in the
        mlx5 driver when using old FW, not blocking but we're pushing for a
        fix.
      
        Current release - new code bugs:
      
         - eth: enetc: workaround for unresponsive pMAC after receiving
           express traffic
      
        Previous releases - regressions:
      
         - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the
           pid/seq fields 0 for backward compatibility
      
        Previous releases - always broken:
      
         - sctp: fix a potential overflow in sctp_ifwdtsn_skip
      
         - mptcp:
            - use mptcp_schedule_work instead of open-coding it and make the
              worker check stricter, to avoid scheduling work on closed
              sockets
            - fix NULL pointer dereference on fastopen early fallback
      
         - skbuff: fix memory corruption due to a race between skb coalescing
           and releasing clones confusing page_pool reference counting
      
         - bonding: fix neighbor solicitation validation on backup slaves
      
         - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp
      
         - bpf: arm64: fixed a BTI error on returning to patched function
      
         - openvswitch: fix race on port output leading to inf loop
      
         - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid
           returning a different errno than expected
      
         - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove
      
         - Bluetooth: fix printing errors if LE Connection times out
      
         - Bluetooth: assorted UaF, deadlock and data race fixes
      
         - eth: macb: fix memory corruption in extended buffer descriptor mode
      
        Misc:
      
         - adjust the XDP Rx flow hash API to also include the protocol layers
           over which the hash was computed"
      
      * tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
        selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
        mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
        veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
        mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
        xdp: rss hash types representation
        selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
        skbuff: Fix a race between coalescing and releasing SKBs
        net: macb: fix a memory corruption in extended buffer descriptor mode
        selftests: add the missing CONFIG_IP_SCTP in net config
        udp6: fix potential access to stale information
        selftests: openvswitch: adjust datapath NL message declaration
        selftests: mptcp: userspace pm: uniform verify events
        mptcp: fix NULL pointer dereference on fastopen early fallback
        mptcp: stricter state check in mptcp_worker
        mptcp: use mptcp_schedule_work instead of open-coding it
        net: enetc: workaround for unresponsive pMAC after receiving express traffic
        sctp: fix a potential overflow in sctp_ifwdtsn_skip
        net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume()
        rtnetlink: Restore RTM_NEW/DELLINK notification behavior
        net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes
        ...
      829cca4d
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 4413ad01
      Linus Torvalds authored
      Pull devicetree fixes from Rob Herring:
      
       - Fix interaction between fw_devlink and DT overlays causing devices to
         not be probed
      
       - Fix the compatible string for loongson,cpu-interrupt-controller
      
      * tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        treewide: Fix probing of devices in DT overlays
        dt-bindings: interrupt-controller: loongarch: Fix mismatched compatible
      4413ad01
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 531f27ad
      Linus Torvalds authored
      Pull pin control fix from Linus Walleij:
       "This is just a revert of the AMD fix, because the fix broke some
        laptops. We are working on a proper solution"
      
      * tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        Revert "pinctrl: amd: Disable and mask interrupts on resume"
      531f27ad
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm · f1be7b6c
      Linus Torvalds authored
      Pull drm fixes from Daniel Vetter:
      
       - two fbcon regressions
      
       - amdgpu: dp mst, smu13
      
       - i915: dual link dsi for tgl+
      
       - armada, nouveau, drm/sched, fbmem
      
      * tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm:
        fbcon: set_con2fb_map needs to set con2fb_map!
        fbcon: Fix error paths in set_con2fb_map
        drm/amd/pm: correct the pcie link state check for SMU13
        drm/amd/pm: correct SMU13.0.7 max shader clock reporting
        drm/amd/pm: correct SMU13.0.7 pstate profiling clock settings
        drm/amd/display: Pass the right info to drm_dp_remove_payload
        drm/armada: Fix a potential double free in an error handling path
        fbmem: Reject FB_ACTIVATE_KD_TEXT from userspace
        drm/nouveau/fb: add missing sysmen flush callbacks
        drm/i915/dsi: fix DSS CTL register offsets for TGL+
        drm/scheduler: Fix UAF race in drm_sched_entity_push_job()
      f1be7b6c
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · d0f89c4c
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2023-04-13
      
      We've added 6 non-merge commits during the last 1 day(s) which contain
      a total of 14 files changed, 205 insertions(+), 38 deletions(-).
      
      The main changes are:
      
      1) One late straggler fix on the XDP hints side which fixes
         bpf_xdp_metadata_rx_hash kfunc API before the release goes out
         in order to provide information on the RSS hash type,
         from Jesper Dangaard Brouer.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
        mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
        veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
        mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
        xdp: rss hash types representation
        selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
      ====================
      
      Link: https://lore.kernel.org/r/20230413192939.10202-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d0f89c4c
    • Daniel Vetter's avatar
      Merge tag 'drm-misc-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · cab29322
      Daniel Vetter authored
      Short summary of fixes pull:
      
       * armada: Fix double free
       * fb: Clear FB_ACTIVATE_KD_TEXT in ioctl
       * nouveau: Add missing callbacks
       * scheduler: Fix use-after-free error
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      From: Thomas Zimmermann <tzimmermann@suse.de>
      Link: https://patchwork.freedesktop.org/patch/msgid/20230413184233.GA8148@linux-uq9g
      cab29322
    • Daniel Borkmann's avatar
      bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap · 8c5c2a48
      Daniel Borkmann authored
      syzbot reported a splat and bisected it to recent commit ed17aa92 ("bpf,
      sockmap: fix deadlocks in the sockhash and sockmap"):
      
        [...]
        WARNING: CPU: 1 PID: 9280 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
        Modules linked in:
        CPU: 1 PID: 9280 Comm: syz-executor.1 Not tainted 6.2.0-syzkaller-13249-gd319f344 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023
        RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
        [...]
        Call Trace:
        <TASK>
        spin_unlock_bh include/linux/spinlock.h:395 [inline]
        sock_map_del_link+0x2ea/0x510 net/core/sock_map.c:165
        sock_map_unref+0xb0/0x1d0 net/core/sock_map.c:184
        sock_hash_delete_elem+0x1ec/0x2a0 net/core/sock_map.c:945
        map_delete_elem kernel/bpf/syscall.c:1536 [inline]
        __sys_bpf+0x2edc/0x53e0 kernel/bpf/syscall.c:5053
        __do_sys_bpf kernel/bpf/syscall.c:5166 [inline]
        __se_sys_bpf kernel/bpf/syscall.c:5164 [inline]
        __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5164
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
        RIP: 0033:0x7fe8f7c8c169
        </TASK>
        [...]
      
      Revert for now until we have a proper solution.
      
      Fixes: ed17aa92 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap")
      Reported-by: syzbot+49f6cef45247ff249498@syzkaller.appspotmail.com
      Cc: Hsin-Wei Hung <hsinweih@uci.edu>
      Cc: Xin Liu <liuxin350@huawei.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/000000000000f1db9605f939720e@google.com/
      8c5c2a48
    • Alexei Starovoitov's avatar
      Merge branch 'XDP-hints: change RX-hash kfunc bpf_xdp_metadata_rx_hash' · b65ef48c
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      
      Current API for bpf_xdp_metadata_rx_hash() returns the raw RSS hash value,
      but doesn't provide information on the RSS hash type (part of 6.3-rc).
      
      This patchset proposal is to change the function call signature via adding
      a pointer value argument for providing the RSS hash type.
      
      Patchset also removes all bpf_printk's from xdp_hw_metadata program
      that we expect driver developers to use. Instead counters are introduced
      for relaying e.g. skip and fail info.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b65ef48c
    • Jesper Dangaard Brouer's avatar
      0f26b74e
    • Jesper Dangaard Brouer's avatar
      mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type · 9123397a
      Jesper Dangaard Brouer authored
      Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
      via matching individual Completion Queue Entry (CQE) status bits.
      
      Fixes: ab46182d ("net/mlx4_en: Support RX XDP metadata")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/168132893562.340624.12779118462402031248.stgit@firesoulSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9123397a
    • Jesper Dangaard Brouer's avatar
      veth: bpf_xdp_metadata_rx_hash add xdp rss hash type · 96b1a098
      Jesper Dangaard Brouer authored
      Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type.
      
      The veth driver currently only support XDP-hints based on SKB code path.
      The SKB have lost information about the RSS hash type, by compressing
      the information down to a single bitfield skb->l4_hash, that only knows
      if this was a L4 hash value.
      
      In preparation for veth, the xdp_rss_hash_type have an L4 indication
      bit that allow us to return a meaningful L4 indication when working
      with SKB based packets.
      
      Fixes: 306531f0 ("veth: Support RX XDP metadata")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/168132893055.340624.16209448340644513469.stgit@firesoulSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96b1a098
    • Jesper Dangaard Brouer's avatar
      mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type · 67f245c2
      Jesper Dangaard Brouer authored
      Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
      via mapping table.
      
      The mlx5 hardware can also identify and RSS hash IPSEC.  This indicate
      hash includes SPI (Security Parameters Index) as part of IPSEC hash.
      
      Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.
      
      Fixes: bc8d405b ("net/mlx5e: Support RX XDP metadata")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoulSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      67f245c2
    • Jesper Dangaard Brouer's avatar
      xdp: rss hash types representation · 0cd917a4
      Jesper Dangaard Brouer authored
      The RSS hash type specifies what portion of packet data NIC hardware used
      when calculating RSS hash value. The RSS types are focused on Internet
      traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
      value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
      primarily TCP vs UDP, but some hardware supports SCTP.
      
      Hardware RSS types are differently encoded for each hardware NIC. Most
      hardware represent RSS hash type as a number. Determining L3 vs L4 often
      requires a mapping table as there often isn't a pattern or sorting
      according to ISO layer.
      
      The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
      contains both BITs for the L3/L4 types, and combinations to be used by
      drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
      to BPF via BTF, and it is up to the BPF-programmer to match using these
      defines.
      
      This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
      a pointer value argument for provide the RSS hash type.
      Change signature for all xmo_rx_hash calls in drivers to make it compile.
      
      The RSS type implementations for each driver comes as separate patches.
      
      Fixes: 3d76a4d3 ("bpf: XDP metadata RX kfuncs")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoulSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0cd917a4
    • Jesper Dangaard Brouer's avatar
      selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters · e8163b98
      Jesper Dangaard Brouer authored
      The tool xdp_hw_metadata can be used by driver developers
      implementing XDP-hints metadata kfuncs.
      
      Remove all bpf_printk calls, as the tool already transfers all the
      XDP-hints related information via metadata area to AF_XDP
      userspace process.
      
      Add counters for providing remaining information about failure and
      skipped packet events.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/168132891533.340624.7313781245316405141.stgit@firesoulSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e8163b98
    • Daniel Vetter's avatar
      fbcon: set_con2fb_map needs to set con2fb_map! · fffb0b52
      Daniel Vetter authored
      I got really badly confused in d443d938 ("fbcon: move more common
      code into fb_open()") because we set the con2fb_map before the failure
      points, which didn't look good.
      
      But in trying to fix that I moved the assignment into the wrong path -
      we need to do it for _all_ vc we take over, not just the first one
      (which additionally requires the call to con2fb_acquire_newinfo).
      
      I've figured this out because of a KASAN bug report, where the
      fbcon_registered_fb and fbcon_display arrays went out of sync in
      fbcon_mode_deleted() because the con2fb_map pointed at the old
      fb_info, but the modes and everything was updated for the new one.
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: default avatarJavier Martinez Canillas <javierm@redhat.com>
      Acked-by: default avatarHelge Deller <deller@gmx.de>
      Tested-by: default avatarXingyuan Mo <hdthky0@gmail.com>
      Fixes: d443d938 ("fbcon: move more common code into fb_open()")
      Reported-by: default avatarXingyuan Mo <hdthky0@gmail.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Xingyuan Mo <hdthky0@gmail.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: <stable@vger.kernel.org> # v5.19+
      fffb0b52
    • Daniel Vetter's avatar
      fbcon: Fix error paths in set_con2fb_map · edf79dd2
      Daniel Vetter authored
      This is a regressoin introduced in b07db395 ("fbcon: Ditch error
      handling for con2fb_release_oldinfo"). I failed to realize what the if
      (!err) checks. The mentioned commit was dropping the
      con2fb_release_oldinfo() return value but the if (!err) was also
      checking whether the con2fb_acquire_newinfo() function call above
      failed or not.
      
      Fix this with an early return statement.
      
      Note that there's still a difference compared to the orginal state of
      the code, the below lines are now also skipped on error:
      
      	if (!search_fb_in_map(info_idx))
      		info_idx = newidx;
      
      These are only needed when we've actually thrown out an old fb_info
      from the console mappings, which only happens later on.
      
      Also move the fbcon_add_cursor_work() call into the same if block,
      it's all protected by console_lock so doesn't matter when we set up
      the blinking cursor delayed work anyway. This further simplifies the
      control flow and allows us to ditch the found local variable.
      
      v2: Clarify commit message (Javier)
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: default avatarJavier Martinez Canillas <javierm@redhat.com>
      Acked-by: default avatarHelge Deller <deller@gmx.de>
      Tested-by: default avatarXingyuan Mo <hdthky0@gmail.com>
      Fixes: b07db395 ("fbcon: Ditch error handling for con2fb_release_oldinfo")
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Xingyuan Mo <hdthky0@gmail.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: <stable@vger.kernel.org> # v5.19+
      edf79dd2
    • Liang Chen's avatar
      skbuff: Fix a race between coalescing and releasing SKBs · 0646dc31
      Liang Chen authored
      Commit 1effe8ca ("skbuff: fix coalescing for page_pool fragment
      recycling") allowed coalescing to proceed with non page pool page and page
      pool page when @from is cloned, i.e.
      
      to->pp_recycle    --> false
      from->pp_recycle  --> true
      skb_cloned(from)  --> true
      
      However, it actually requires skb_cloned(@from) to hold true until
      coalescing finishes in this situation. If the other cloned SKB is
      released while the merging is in process, from_shinfo->nr_frags will be
      set to 0 toward the end of the function, causing the increment of frag
      page _refcount to be unexpectedly skipped resulting in inconsistent
      reference counts. Later when SKB(@to) is released, it frees the page
      directly even though the page pool page is still in use, leading to
      use-after-free or double-free errors. So it should be prohibited.
      
      The double-free error message below prompted us to investigate:
      BUG: Bad page state in process swapper/1  pfn:0e0d1
      page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
      index:0x2 pfn:0xe0d1
      flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
      raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
      page dumped because: nonzero _refcount
      
      CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
      Call Trace:
       <IRQ>
      dump_stack_lvl+0x32/0x50
      bad_page+0x69/0xf0
      free_pcp_prepare+0x260/0x2f0
      free_unref_page+0x20/0x1c0
      skb_release_data+0x10b/0x1a0
      napi_consume_skb+0x56/0x150
      net_rx_action+0xf0/0x350
      ? __napi_schedule+0x79/0x90
      __do_softirq+0xc8/0x2b1
      __irq_exit_rcu+0xb9/0xf0
      common_interrupt+0x82/0xa0
      </IRQ>
      <TASK>
      asm_common_interrupt+0x22/0x40
      RIP: 0010:default_idle+0xb/0x20
      
      Fixes: 53e0961d ("page_pool: add frag page recycling support in page pool")
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20230413090353.14448-1-liangchen.linux@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0646dc31
    • Roman Gushchin's avatar
      net: macb: fix a memory corruption in extended buffer descriptor mode · e8b74453
      Roman Gushchin authored
      For quite some time we were chasing a bug which looked like a sudden
      permanent failure of networking and mmc on some of our devices.
      The bug was very sensitive to any software changes and even more to
      any kernel debug options.
      
      Finally we got a setup where the problem was reproducible with
      CONFIG_DMA_API_DEBUG=y and it revealed the issue with the rx dma:
      
      [   16.992082] ------------[ cut here ]------------
      [   16.996779] DMA-API: macb ff0b0000.ethernet: device driver tries to free DMA memory it has not allocated [device address=0x0000000875e3e244] [size=1536 bytes]
      [   17.011049] WARNING: CPU: 0 PID: 85 at kernel/dma/debug.c:1011 check_unmap+0x6a0/0x900
      [   17.018977] Modules linked in: xxxxx
      [   17.038823] CPU: 0 PID: 85 Comm: irq/55-8000f000 Not tainted 5.4.0 #28
      [   17.045345] Hardware name: xxxxx
      [   17.049528] pstate: 60000005 (nZCv daif -PAN -UAO)
      [   17.054322] pc : check_unmap+0x6a0/0x900
      [   17.058243] lr : check_unmap+0x6a0/0x900
      [   17.062163] sp : ffffffc010003c40
      [   17.065470] x29: ffffffc010003c40 x28: 000000004000c03c
      [   17.070783] x27: ffffffc010da7048 x26: ffffff8878e38800
      [   17.076095] x25: ffffff8879d22810 x24: ffffffc010003cc8
      [   17.081407] x23: 0000000000000000 x22: ffffffc010a08750
      [   17.086719] x21: ffffff8878e3c7c0 x20: ffffffc010acb000
      [   17.092032] x19: 0000000875e3e244 x18: 0000000000000010
      [   17.097343] x17: 0000000000000000 x16: 0000000000000000
      [   17.102647] x15: ffffff8879e4a988 x14: 0720072007200720
      [   17.107959] x13: 0720072007200720 x12: 0720072007200720
      [   17.113261] x11: 0720072007200720 x10: 0720072007200720
      [   17.118565] x9 : 0720072007200720 x8 : 000000000000022d
      [   17.123869] x7 : 0000000000000015 x6 : 0000000000000098
      [   17.129173] x5 : 0000000000000000 x4 : 0000000000000000
      [   17.134475] x3 : 00000000ffffffff x2 : ffffffc010a1d370
      [   17.139778] x1 : b420c9d75d27bb00 x0 : 0000000000000000
      [   17.145082] Call trace:
      [   17.147524]  check_unmap+0x6a0/0x900
      [   17.151091]  debug_dma_unmap_page+0x88/0x90
      [   17.155266]  gem_rx+0x114/0x2f0
      [   17.158396]  macb_poll+0x58/0x100
      [   17.161705]  net_rx_action+0x118/0x400
      [   17.165445]  __do_softirq+0x138/0x36c
      [   17.169100]  irq_exit+0x98/0xc0
      [   17.172234]  __handle_domain_irq+0x64/0xc0
      [   17.176320]  gic_handle_irq+0x5c/0xc0
      [   17.179974]  el1_irq+0xb8/0x140
      [   17.183109]  xiic_process+0x5c/0xe30
      [   17.186677]  irq_thread_fn+0x28/0x90
      [   17.190244]  irq_thread+0x208/0x2a0
      [   17.193724]  kthread+0x130/0x140
      [   17.196945]  ret_from_fork+0x10/0x20
      [   17.200510] ---[ end trace 7240980785f81d6f ]---
      
      [  237.021490] ------------[ cut here ]------------
      [  237.026129] DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000021d79e7b
      [  237.033886] WARNING: CPU: 0 PID: 0 at kernel/dma/debug.c:499 add_dma_entry+0x214/0x240
      [  237.041802] Modules linked in: xxxxx
      [  237.061637] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W         5.4.0 #28
      [  237.068941] Hardware name: xxxxx
      [  237.073116] pstate: 80000085 (Nzcv daIf -PAN -UAO)
      [  237.077900] pc : add_dma_entry+0x214/0x240
      [  237.081986] lr : add_dma_entry+0x214/0x240
      [  237.086072] sp : ffffffc010003c30
      [  237.089379] x29: ffffffc010003c30 x28: ffffff8878a0be00
      [  237.094683] x27: 0000000000000180 x26: ffffff8878e387c0
      [  237.099987] x25: 0000000000000002 x24: 0000000000000000
      [  237.105290] x23: 000000000000003b x22: ffffffc010a0fa00
      [  237.110594] x21: 0000000021d79e7b x20: ffffffc010abe600
      [  237.115897] x19: 00000000ffffffef x18: 0000000000000010
      [  237.121201] x17: 0000000000000000 x16: 0000000000000000
      [  237.126504] x15: ffffffc010a0fdc8 x14: 0720072007200720
      [  237.131807] x13: 0720072007200720 x12: 0720072007200720
      [  237.137111] x11: 0720072007200720 x10: 0720072007200720
      [  237.142415] x9 : 0720072007200720 x8 : 0000000000000259
      [  237.147718] x7 : 0000000000000001 x6 : 0000000000000000
      [  237.153022] x5 : ffffffc010003a20 x4 : 0000000000000001
      [  237.158325] x3 : 0000000000000006 x2 : 0000000000000007
      [  237.163628] x1 : 8ac721b3a7dc1c00 x0 : 0000000000000000
      [  237.168932] Call trace:
      [  237.171373]  add_dma_entry+0x214/0x240
      [  237.175115]  debug_dma_map_page+0xf8/0x120
      [  237.179203]  gem_rx_refill+0x190/0x280
      [  237.182942]  gem_rx+0x224/0x2f0
      [  237.186075]  macb_poll+0x58/0x100
      [  237.189384]  net_rx_action+0x118/0x400
      [  237.193125]  __do_softirq+0x138/0x36c
      [  237.196780]  irq_exit+0x98/0xc0
      [  237.199914]  __handle_domain_irq+0x64/0xc0
      [  237.204000]  gic_handle_irq+0x5c/0xc0
      [  237.207654]  el1_irq+0xb8/0x140
      [  237.210789]  arch_cpu_idle+0x40/0x200
      [  237.214444]  default_idle_call+0x18/0x30
      [  237.218359]  do_idle+0x200/0x280
      [  237.221578]  cpu_startup_entry+0x20/0x30
      [  237.225493]  rest_init+0xe4/0xf0
      [  237.228713]  arch_call_rest_init+0xc/0x14
      [  237.232714]  start_kernel+0x47c/0x4a8
      [  237.236367] ---[ end trace 7240980785f81d70 ]---
      
      Lars was fast to find an explanation: according to the datasheet
      bit 2 of the rx buffer descriptor entry has a different meaning in the
      extended mode:
        Address [2] of beginning of buffer, or
        in extended buffer descriptor mode (DMA configuration register [28] = 1),
        indicates a valid timestamp in the buffer descriptor entry.
      
      The macb driver didn't mask this bit while getting an address and it
      eventually caused a memory corruption and a dma failure.
      
      The problem is resolved by explicitly clearing the problematic bit
      if hw timestamping is used.
      
      Fixes: 7b429614 ("net: macb: Add support for PTP timestamps in DMA descriptors")
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Co-developed-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20230412232144.770336-1-roman.gushchin@linux.devSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e8b74453
    • Xin Long's avatar
      selftests: add the missing CONFIG_IP_SCTP in net config · 3a0385be
      Xin Long authored
      The selftest sctp_vrf needs CONFIG_IP_SCTP set in config
      when building the kernel, so add it.
      
      Fixes: a61bd7b9 ("selftests: add a selftest for sctp vrf")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Link: https://lore.kernel.org/r/61dddebc4d2dd98fe7fb145e24d4b2430e42b572.1681312386.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3a0385be
    • Eric Dumazet's avatar
      udp6: fix potential access to stale information · 1c5950fc
      Eric Dumazet authored
      lena wang reported an issue caused by udpv6_sendmsg()
      mangling msg->msg_name and msg->msg_namelen, which
      are later read from ____sys_sendmsg() :
      
      	/*
      	 * If this is sendmmsg() and sending to current destination address was
      	 * successful, remember it.
      	 */
      	if (used_address && err >= 0) {
      		used_address->name_len = msg_sys->msg_namelen;
      		if (msg_sys->msg_name)
      			memcpy(&used_address->name, msg_sys->msg_name,
      			       used_address->name_len);
      	}
      
      udpv6_sendmsg() wants to pretend the remote address family
      is AF_INET in order to call udp_sendmsg().
      
      A fix would be to modify the address in-place, instead
      of using a local variable, but this could have other side effects.
      
      Instead, restore initial values before we return from udpv6_sendmsg().
      
      Fixes: c71d8ebe ("net: Fix security_socket_sendmsg() bypass problem.")
      Reported-by: default avatarlena wang <lena.wang@mediatek.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarMaciej Żenczykowski <maze@google.com>
      Link: https://lore.kernel.org/r/20230412130308.1202254-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1c5950fc
    • Aaron Conole's avatar
      selftests: openvswitch: adjust datapath NL message declaration · 306dc213
      Aaron Conole authored
      The netlink message for creating a new datapath takes an array
      of ports for the PID creation.  This shouldn't cause much issue
      but correct it for future cases where we need to do decode of
      datapath information that could include the per-cpu PID map.
      
      Fixes: 25f16c87 ("selftests: add openvswitch selftest suite")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Link: https://lore.kernel.org/r/20230412115828.3991806-1-aconole@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      306dc213
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-more-fixes-for-6-3' · ecfcc6fb
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: more fixes for 6.3
      
      Patch 1 avoids scheduling the MPTCP worker on a closed socket on some
      edge cases. It fixes issues that can be visible from v5.11.
      
      Patch 2 makes sure the MPTCP worker doesn't try to manipulate
      disconnected sockets. This is also a fix for an issue that can be
      visible from v5.11.
      
      Patch 3 fixes a NULL pointer dereference when MPTCP FastOpen is used
      and an early fallback is done. A fix for v6.2.
      
      Patch 4 improves the stability of the userspace PM selftest for a
      subtest added in v6.2.
      ====================
      
      Link: https://lore.kernel.org/r/20230411-upstream-net-20230411-mptcp-fixes-v1-0-ca540f3ef986@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ecfcc6fb
    • Matthieu Baerts's avatar
      selftests: mptcp: userspace pm: uniform verify events · 711ae788
      Matthieu Baerts authored
      Simply adding a "sleep" before checking something is usually not a good
      idea because the time that has been picked can not be enough or too
      much. The best is to wait for events with a timeout.
      
      In this selftest, 'sleep 0.5' is used more than 40 times. It is always
      used before calling a 'verify_*' function except for this
      verify_listener_events which has been added later.
      
      At the end, using all these 'sleep 0.5' seems to work: the slow CIs
      don't complain so far. Also because it doesn't take too much time, we
      can just add two more 'sleep 0.5' to uniform what is done before calling
      a 'verify_*' function. For the same reasons, we can also delay a bigger
      refactoring to replace all these 'sleep 0.5' by functions waiting for
      events instead of waiting for a fix time and hope for the best.
      
      Fixes: 6c73008a ("selftests: mptcp: listener test for userspace PM")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      711ae788
    • Paolo Abeni's avatar
      mptcp: fix NULL pointer dereference on fastopen early fallback · c0ff6f6d
      Paolo Abeni authored
      In case of early fallback to TCP, subflow_syn_recv_sock() deletes
      the subflow context before returning the newly allocated sock to
      the caller.
      
      The fastopen path does not cope with the above unconditionally
      dereferencing the subflow context.
      
      Fixes: 36b122ba ("mptcp: add subflow_v(4,6)_send_synack()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c0ff6f6d
    • Paolo Abeni's avatar
      mptcp: stricter state check in mptcp_worker · d6a04437
      Paolo Abeni authored
      As reported by Christoph, the mptcp protocol can run the
      worker when the relevant msk socket is in an unexpected state:
      
      connect()
      // incoming reset + fastclose
      // the mptcp worker is scheduled
      mptcp_disconnect()
      // msk is now CLOSED
      listen()
      mptcp_worker()
      
      Leading to the following splat:
      
      divide error: 0000 [#1] PREEMPT SMP
      CPU: 1 PID: 21 Comm: kworker/1:0 Not tainted 6.3.0-rc1-gde5e8fd0123c #11
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: events mptcp_worker
      RIP: 0010:__tcp_select_window+0x22c/0x4b0 net/ipv4/tcp_output.c:3018
      RSP: 0018:ffffc900000b3c98 EFLAGS: 00010293
      RAX: 000000000000ffd7 RBX: 000000000000ffd7 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff8214ce97 RDI: 0000000000000004
      RBP: 000000000000ffd7 R08: 0000000000000004 R09: 0000000000010000
      R10: 000000000000ffd7 R11: ffff888005afa148 R12: 000000000000ffd7
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000405270 CR3: 000000003011e006 CR4: 0000000000370ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_select_window net/ipv4/tcp_output.c:262 [inline]
       __tcp_transmit_skb+0x356/0x1280 net/ipv4/tcp_output.c:1345
       tcp_transmit_skb net/ipv4/tcp_output.c:1417 [inline]
       tcp_send_active_reset+0x13e/0x320 net/ipv4/tcp_output.c:3459
       mptcp_check_fastclose net/mptcp/protocol.c:2530 [inline]
       mptcp_worker+0x6c7/0x800 net/mptcp/protocol.c:2705
       process_one_work+0x3bd/0x950 kernel/workqueue.c:2390
       worker_thread+0x5b/0x610 kernel/workqueue.c:2537
       kthread+0x138/0x170 kernel/kthread.c:376
       ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308
       </TASK>
      
      This change addresses the issue explicitly checking for bad states
      before running the mptcp worker.
      
      Fixes: e16163b6 ("mptcp: refactor shutdown and close")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/374Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Tested-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d6a04437