1. 20 May, 2023 7 commits
    • Aditi Ghag's avatar
      bpf: Add bpf_sock_destroy kfunc · 4ddbcb88
      Aditi Ghag authored
      The socket destroy kfunc is used to forcefully terminate sockets from
      certain BPF contexts. We plan to use the capability in Cilium
      load-balancing to terminate client sockets that continue to connect to
      deleted backends.  The other use case is on-the-fly policy enforcement
      where existing socket connections prevented by policies need to be
      forcefully terminated.  The kfunc also allows terminating sockets that may
      or may not be actively sending traffic.
      
      The kfunc can currently be called only from BPF TCP and UDP iterators
      where users can filter, and terminate selected sockets. More
      specifically, it can only be called from  BPF contexts that ensure
      socket locking in order to allow synchronous execution of protocol
      specific `diag_destroy` handlers. The previous commit that batches UDP
      sockets during iteration facilitated a synchronous invocation of the UDP
      destroy callback from BPF context by skipping socket locks in
      `udp_abort`. TCP iterator already supported batching of sockets being
      iterated. To that end, `tracing_iter_filter` callback filter is added so
      that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
      attach type, and reject other programs.
      
      The kfunc takes `sock_common` type argument, even though it expects, and
      casts them to a `sock` pointer. This enables the verifier to allow the
      sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
      `sock` structs. Furthermore, as `sock_common` only has a subset of
      certain fields of `sock`, casting pointer to the latter type might not
      always be safe for certain sockets like request sockets, but these have a
      special handling in the diag_destroy handlers.
      
      Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
      cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
      eg. getting a sk pointer (may be even NULL) by following another sk
      pointer. The pointer socket argument passed in TCP and UDP iterators is
      tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
      are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      4ddbcb88
    • Aditi Ghag's avatar
      bpf: Add kfunc filter function to 'struct btf_kfunc_id_set' · e924e80e
      Aditi Ghag authored
      This commit adds the ability to filter kfuncs to certain BPF program
      types. This is required to limit bpf_sock_destroy kfunc implemented in
      follow-up commits to programs with attach type 'BPF_TRACE_ITER'.
      
      The commit adds a callback filter to 'struct btf_kfunc_id_set'.  The
      filter has access to the `bpf_prog` construct including its properties
      such as `expected_attached_type`.
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-7-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      e924e80e
    • Aditi Ghag's avatar
      bpf: udp: Implement batching for sockets iterator · c96dac8d
      Aditi Ghag authored
      Batch UDP sockets from BPF iterator that allows for overlapping locking
      semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
      BPF socket destroy kfunc (introduced by follow-up patches) to execute from
      BPF iterator programs.
      
      Previously, BPF iterators acquired the sock lock and sockets hash table
      bucket lock while executing BPF programs. This prevented BPF helpers that
      again acquire these locks to be executed from BPF iterators.  With the
      batching approach, we acquire a bucket lock, batch all the bucket sockets,
      and then release the bucket lock. This enables BPF or kernel helpers to
      skip sock locking when invoked in the supported BPF contexts.
      
      The batching logic is similar to the logic implemented in TCP iterator:
      https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
      Suggested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-6-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      c96dac8d
    • Aditi Ghag's avatar
      udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state · e4fe1bf1
      Aditi Ghag authored
      This is a preparatory commit to remove the field. The field was
      previously shared between proc fs and BPF UDP socket iterators. As the
      follow-up commits will decouple the implementation for the iterators,
      remove the field. As for BPF socket iterator, filtering of sockets is
      exepected to be done in BPF programs.
      Suggested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-5-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      e4fe1bf1
    • Aditi Ghag's avatar
      bpf: udp: Encapsulate logic to get udp table · 7625d2e9
      Aditi Ghag authored
      This is a preparatory commit that encapsulates the logic
      to get udp table in iterator inside udp_get_table_afinfo, and
      renames the function to `udp_get_table_seq` accordingly.
      Suggested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-4-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      7625d2e9
    • Aditi Ghag's avatar
      udp: seq_file: Helper function to match socket attributes · f44b1c51
      Aditi Ghag authored
      This is a preparatory commit to refactor code that matches socket
      attributes in iterators to a helper function, and use it in the
      proc fs iterator.
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-3-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      f44b1c51
    • Aditi Ghag's avatar
      bpf: tcp: Avoid taking fast sock lock in iterator · 9378096e
      Aditi Ghag authored
      This is a preparatory commit to replace `lock_sock_fast` with
      `lock_sock`,and facilitate BPF programs executed from the TCP sockets
      iterator to be able to destroy TCP sockets using the bpf_sock_destroy
      kfunc (implemented in follow-up commits).
      
      Previously, BPF TCP iterator was acquiring the sock lock with BH
      disabled. This led to scenarios where the sockets hash table bucket lock
      can be acquired with BH enabled in some path versus disabled in other.
      In such situation, kernel issued a warning since it thinks that in the
      BH enabled path the same bucket lock *might* be acquired again in the
      softirq context (BH disabled), which will lead to a potential dead lock.
      Since bpf_sock_destroy also happens in a process context, the potential
      deadlock warning is likely a false alarm.
      
      Here is a snippet of annotated stack trace that motivated this change:
      
      ```
      
      Possible interrupt unsafe locking scenario:
      
            CPU0                    CPU1
            ----                    ----
       lock(&h->lhash2[i].lock);
                                    local_bh_disable();
                                    lock(&h->lhash2[i].lock);
      kernel imagined possible scenario:
        local_bh_disable();  /* Possible softirq */
        lock(&h->lhash2[i].lock);
      *** Potential Deadlock ***
      
      process context:
      
      lock_acquire+0xcd/0x330
      _raw_spin_lock+0x33/0x40
      ------> Acquire (bucket) lhash2.lock with BH enabled
      __inet_hash+0x4b/0x210
      inet_csk_listen_start+0xe6/0x100
      inet_listen+0x95/0x1d0
      __sys_listen+0x69/0xb0
      __x64_sys_listen+0x14/0x20
      do_syscall_64+0x3c/0x90
      entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      bpf_sock_destroy run from iterator:
      
      lock_acquire+0xcd/0x330
      _raw_spin_lock+0x33/0x40
      ------> Acquire (bucket) lhash2.lock with BH disabled
      inet_unhash+0x9a/0x110
      tcp_set_state+0x6a/0x210
      tcp_abort+0x10d/0x200
      bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
      bpf_iter_run_prog+0x1ff/0x340
      ------> lock_sock_fast that acquires sock lock with BH disabled
      bpf_iter_tcp_seq_show+0xca/0x190
      bpf_seq_read+0x177/0x450
      
      ```
      
      Also, Yonghong reported a deadlock for non-listening TCP sockets that
      this change resolves. Previously, `lock_sock_fast` held the sock spin
      lock with BH which was again being acquired in `tcp_abort`:
      
      ```
      watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
      RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
      Call Trace:
       <TASK>
       _raw_spin_lock+0x84/0x90
       tcp_abort+0x13c/0x1f0
       bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
       bpf_iter_run_prog+0x1aa/0x2c0
       ? preempt_count_sub+0x1c/0xd0
       ? from_kuid_munged+0x1c8/0x210
       bpf_iter_tcp_seq_show+0x14e/0x1b0
       bpf_seq_read+0x36c/0x6a0
      
      bpf_iter_tcp_seq_show
         lock_sock_fast
           __lock_sock_fast
             spin_lock_bh(&sk->sk_lock.slock);
      	/* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */
      
       ...
       tcp_abort
         local_bh_disable();
         spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)
      
      ```
      
      With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning:
      
      ```
      lock_sock
          lock_sock_nested
             spin_lock_bh(&sk->sk_lock.slock);
             :
             spin_unlock_bh(&sk->sk_lock.slock);
      ```
      Acked-by: default avatarYonghong Song <yhs@meta.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-2-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      9378096e
  2. 19 May, 2023 3 commits
  3. 17 May, 2023 30 commits