An error occurred fetching the project authors.
  1. 01 Oct, 2023 1 commit
    • Eric Dumazet's avatar
      tcp: derive delack_max from rto_min · bbf80d71
      Eric Dumazet authored
      While BPF allows to set icsk->->icsk_delack_max
      and/or icsk->icsk_rto_min, we have an ip route
      attribute (RTAX_RTO_MIN) to be able to tune rto_min,
      but nothing to consequently adjust max delayed ack,
      which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}).
      
      This makes RTAX_RTO_MIN of almost no practical use,
      unless customers are in big trouble.
      
      Modern days datacenter communications want to set
      rto_min to ~5 ms, and the max delayed ack one jiffie
      smaller to avoid spurious retransmits.
      
      After this patch, an "rto_min 5" route attribute will
      effectively lower max delayed ack timers to 4 ms.
      
      Note in the following ss output, "rto:6 ... ato:4"
      
      $ ss -temoi dst XXXXXX
      State Recv-Q Send-Q           Local Address:Port       Peer Address:Port  Process
      ESTAB 0      0        [2002:a05:6608:295::]:52950   [2002:a05:6608:297::]:41597
           ino:255134 sk:1001 <->
               skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack
       cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500
       rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121
       bytes_received:54823120 segs_out:1370582 segs_in:1370580
       data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps
       pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579
       busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920
       rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536
      
      While we could argue this patch fixes a bug with RTAX_RTO_MIN,
      I do not add a Fixes: tag, so that we can soak it a bit before
      asking backports to stable branches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbf80d71
  2. 16 Sep, 2023 1 commit
    • Aananth V's avatar
      tcp: new TCP_INFO stats for RTO events · 3868ab0f
      Aananth V authored
      The 2023 SIGCOMM paper "Improving Network Availability with Protective
      ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can
      effectively reduce application disruption during outages. To better
      measure the efficacy of this feature, this patch adds three more
      detailed stats during RTO recovery and exports via TCP_INFO.
      Applications and monitoring systems can leverage this data to measure
      the network path diversity and end-to-end repair latency during network
      outages to improve their network infrastructure.
      
      The following counters are added to tcp_sock in order to track RTO
      events over the lifetime of a TCP socket.
      
      1. u16 total_rto - Counts the total number of RTO timeouts.
      2. u16 total_rto_recoveries - Counts the total number of RTO recoveries.
      3. u32 total_rto_time - Counts the total time spent (ms) in RTO
                              recoveries. (time spent in CA_Loss and
                              CA_Recovery states)
      
      To compute total_rto_time, we add a new u32 rto_stamp field to
      tcp_sock. rto_stamp records the start timestamp (ms) of the last RTO
      recovery (CA_Loss).
      
      Corresponding fields are also added to the tcp_info struct.
      Signed-off-by: default avatarAananth V <aananthv@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3868ab0f
  3. 01 Sep, 2023 1 commit
  4. 18 Aug, 2023 1 commit
  5. 16 Aug, 2023 1 commit
  6. 06 Aug, 2023 6 commits
  7. 20 Jul, 2023 11 commits
  8. 19 Jul, 2023 1 commit
    • Eric Dumazet's avatar
      tcp: get rid of sysctl_tcp_adv_win_scale · dfa2f048
      Eric Dumazet authored
      With modern NIC drivers shifting to full page allocations per
      received frame, we face the following issue:
      
      TCP has one per-netns sysctl used to tweak how to translate
      a memory use into an expected payload (RWIN), in RX path.
      
      tcp_win_from_space() implementation is limited to few cases.
      
      For hosts dealing with various MSS, we either under estimate
      or over estimate the RWIN we send to the remote peers.
      
      For instance with the default sysctl_tcp_adv_win_scale value,
      we expect to store 50% of payload per allocated chunk of memory.
      
      For the typical use of MTU=1500 traffic, and order-0 pages allocations
      by NIC drivers, we are sending too big RWIN, leading to potential
      tcp collapse operations, which are extremely expensive and source
      of latency spikes.
      
      This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
      uses a per socket scaling factor, so that we can precisely
      adjust the RWIN based on effective skb->len/skb->truesize ratio.
      
      This patch alone can double TCP receive performance when receivers
      are too slow to drain their receive queue, or by allowing
      a bigger RWIN when MSS is close to PAGE_SIZE.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dfa2f048
  9. 24 Jun, 2023 2 commits
  10. 18 Jun, 2023 1 commit
    • Arjun Roy's avatar
      tcp: Use per-vma locking for receive zerocopy · 7a7f0946
      Arjun Roy authored
      Per-VMA locking allows us to lock a struct vm_area_struct without
      taking the process-wide mmap lock in read mode.
      
      Consider a process workload where the mmap lock is taken constantly in
      write mode. In this scenario, all zerocopy receives are periodically
      blocked during that period of time - though in principle, the memory
      ranges being used by TCP are not touched by the operations that need
      the mmap write lock. This results in performance degradation.
      
      Now consider another workload where the mmap lock is never taken in
      write mode, but there are many TCP connections using receive zerocopy
      that are concurrently receiving. These connections all take the mmap
      lock in read mode, but this does induce a lot of contention and atomic
      ops for this process-wide lock. This results in additional CPU
      overhead caused by contending on the cache line for this lock.
      
      However, with per-vma locking, both of these problems can be avoided.
      
      As a test, I ran an RPC-style request/response workload with 4KB
      payloads and receive zerocopy enabled, with 100 simultaneous TCP
      connections. I measured perf cycles within the
      find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
      without per-vma locking enabled.
      
      When using process-wide mmap semaphore read locking, about 1% of
      measured perf cycles were within this path. With per-VMA locking, this
      value dropped to about 0.45%.
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a7f0946
  11. 16 Jun, 2023 1 commit
    • Breno Leitao's avatar
      net: ioctl: Use kernel memory on protocol ioctl callbacks · e1d001fa
      Breno Leitao authored
      Most of the ioctls to net protocols operates directly on userspace
      argument (arg). Usually doing get_user()/put_user() directly in the
      ioctl callback.  This is not flexible, because it is hard to reuse these
      functions without passing userspace buffers.
      
      Change the "struct proto" ioctls to avoid touching userspace memory and
      operate on kernel buffers, i.e., all protocol's ioctl callbacks is
      adapted to operate on a kernel memory other than on userspace (so, no
      more {put,get}_user() and friends being called in the ioctl callback).
      
      This changes the "struct proto" ioctl format in the following way:
      
          int                     (*ioctl)(struct sock *sk, int cmd,
      -                                        unsigned long arg);
      +                                        int *karg);
      
      (Important to say that this patch does not touch the "struct proto_ops"
      protocols)
      
      So, the "karg" argument, which is passed to the ioctl callback, is a
      pointer allocated to kernel space memory (inside a function wrapper).
      This buffer (karg) may contain input argument (copied from userspace in
      a prep function) and it might return a value/buffer, which is copied
      back to userspace if necessary. There is not one-size-fits-all format
      (that is I am using 'may' above), but basically, there are three type of
      ioctls:
      
      1) Do not read from userspace, returns a result to userspace
      2) Read an input parameter from userspace, and does not return anything
        to userspace
      3) Read an input from userspace, and return a buffer to userspace.
      
      The default case (1) (where no input parameter is given, and an "int" is
      returned to userspace) encompasses more than 90% of the cases, but there
      are two other exceptions. Here is a list of exceptions:
      
      * Protocol RAW:
         * cmd = SIOCGETVIFCNT:
           * input and output = struct sioc_vif_req
         * cmd = SIOCGETSGCNT
           * input and output = struct sioc_sg_req
         * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
           argument, which is struct sioc_vif_req. Then the callback populates
           the struct, which is copied back to userspace.
      
      * Protocol RAW6:
         * cmd = SIOCGETMIFCNT_IN6
           * input and output = struct sioc_mif_req6
         * cmd = SIOCGETSGCNT_IN6
           * input and output = struct sioc_sg_req6
      
      * Protocol PHONET:
        * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
           * input int (4 bytes)
        * Nothing is copied back to userspace.
      
      For the exception cases, functions sock_sk_ioctl_inout() will
      copy the userspace input, and copy it back to kernel space.
      
      The wrapper that prepare the buffer and put the buffer back to user is
      sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
      calls sk_ioctl(), which will handle all cases.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1d001fa
  12. 12 Jun, 2023 2 commits
  13. 09 Jun, 2023 1 commit
  14. 30 May, 2023 2 commits
    • Cambda Zhu's avatar
      tcp: Return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if user_mss set · 34dfde4a
      Cambda Zhu authored
      This patch replaces the tp->mss_cache check in getting TCP_MAXSEG
      with tp->rx_opt.user_mss check for CLOSE/LISTEN sock. Since
      tp->mss_cache is initialized with TCP_MSS_DEFAULT, checking if
      it's zero is probably a bug.
      
      With this change, getting TCP_MAXSEG before connecting will return
      default MSS normally, and return user_mss if user_mss is set.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarJack Yang <mingliang@linux.alibaba.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/netdev/CANn89i+3kL9pYtkxkwxwNMzvC_w3LNUum_2=3u+UyLBmGmifHA@mail.gmail.com/#tSigned-off-by: default avatarCambda Zhu <cambda@linux.alibaba.com>
      Link: https://lore.kernel.org/netdev/14D45862-36EA-4076-974C-EA67513C92F6@linux.alibaba.com/Reviewed-by: default avatarJason Xing <kerneljasonxing@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20230527040317.68247-1-cambda@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34dfde4a
    • Eric Dumazet's avatar
      tcp: deny tcp_disconnect() when threads are waiting · 4faeee0c
      Eric Dumazet authored
      Historically connect(AF_UNSPEC) has been abused by syzkaller
      and other fuzzers to trigger various bugs.
      
      A recent one triggers a divide-by-zero [1], and Paolo Abeni
      was able to diagnose the issue.
      
      tcp_recvmsg_locked() has tests about sk_state being not TCP_LISTEN
      and TCP REPAIR mode being not used.
      
      Then later if socket lock is released in sk_wait_data(),
      another thread can call connect(AF_UNSPEC), then make this
      socket a TCP listener.
      
      When recvmsg() is resumed, it can eventually call tcp_cleanup_rbuf()
      and attempt a divide by 0 in tcp_rcv_space_adjust() [1]
      
      This patch adds a new socket field, counting number of threads
      blocked in sk_wait_event() and inet_wait_for_connect().
      
      If this counter is not zero, tcp_disconnect() returns an error.
      
      This patch adds code in blocking socket system calls, thus should
      not hurt performance of non blocking ones.
      
      Note that we probably could revert commit 499350a5 ("tcp:
      initialize rcv_mss to TCP_MIN_MSS instead of 0") to restore
      original tcpi_rcv_mss meaning (was 0 if no payload was ever
      received on a socket)
      
      [1]
      divide error: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 13832 Comm: syz-executor.5 Not tainted 6.3.0-rc4-syzkaller-00224-g00c7b5f4 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
      RIP: 0010:tcp_rcv_space_adjust+0x36e/0x9d0 net/ipv4/tcp_input.c:740
      Code: 00 00 00 00 fc ff df 4c 89 64 24 48 8b 44 24 04 44 89 f9 41 81 c7 80 03 00 00 c1 e1 04 44 29 f0 48 63 c9 48 01 e9 48 0f af c1 <49> f7 f6 48 8d 04 41 48 89 44 24 40 48 8b 44 24 30 48 c1 e8 03 48
      RSP: 0018:ffffc900033af660 EFLAGS: 00010206
      RAX: 4a66b76cbade2c48 RBX: ffff888076640cc0 RCX: 00000000c334e4ac
      RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000001
      RBP: 00000000c324e86c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880766417f8
      R13: ffff888028fbb980 R14: 0000000000000000 R15: 0000000000010344
      FS: 00007f5bffbfe700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b32f25000 CR3: 000000007ced0000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <TASK>
      tcp_recvmsg_locked+0x100e/0x22e0 net/ipv4/tcp.c:2616
      tcp_recvmsg+0x117/0x620 net/ipv4/tcp.c:2681
      inet6_recvmsg+0x114/0x640 net/ipv6/af_inet6.c:670
      sock_recvmsg_nosec net/socket.c:1017 [inline]
      sock_recvmsg+0xe2/0x160 net/socket.c:1038
      ____sys_recvmsg+0x210/0x5a0 net/socket.c:2720
      ___sys_recvmsg+0xf2/0x180 net/socket.c:2762
      do_recvmmsg+0x25e/0x6e0 net/socket.c:2856
      __sys_recvmmsg net/socket.c:2935 [inline]
      __do_sys_recvmmsg net/socket.c:2958 [inline]
      __se_sys_recvmmsg net/socket.c:2951 [inline]
      __x64_sys_recvmmsg+0x20f/0x260 net/socket.c:2951
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f5c0108c0f9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f5bffbfe168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
      RAX: ffffffffffffffda RBX: 00007f5c011ac050 RCX: 00007f5c0108c0f9
      RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000003
      RBP: 00007f5c010e7b39 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007f5c012cfb1f R14: 00007f5bffbfe300 R15: 0000000000022000
      </TASK>
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Diagnosed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20230526163458.2880232-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4faeee0c
  15. 24 May, 2023 3 commits
  16. 23 May, 2023 2 commits
    • John Fastabend's avatar
      bpf, sockmap: Incorrectly handling copied_seq · e5c6de5f
      John Fastabend authored
      The read_skb() logic is incrementing the tcp->copied_seq which is used for
      among other things calculating how many outstanding bytes can be read by
      the application. This results in application errors, if the application
      does an ioctl(FIONREAD) we return zero because this is calculated from
      the copied_seq value.
      
      To fix this we move tcp->copied_seq accounting into the recv handler so
      that we update these when the recvmsg() hook is called and data is in
      fact copied into user buffers. This gives an accurate FIONREAD value
      as expected and improves ACK handling. Before we were calling the
      tcp_rcv_space_adjust() which would update 'number of bytes copied to
      user in last RTT' which is wrong for programs returning SK_PASS. The
      bytes are only copied to the user when recvmsg is handled.
      
      Doing the fix for recvmsg is straightforward, but fixing redirect and
      SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
      call this from skmsg handlers. This fixes another issue where a broken
      socket with a BPF program doing a resubmit could hang the receiver. This
      happened because although read_skb() consumed the skb through sock_drop()
      it did not update the copied_seq. Now if a single reccv socket is
      redirecting to many sockets (for example for lb) the receiver sk will be
      hung even though we might expect it to continue. The hang comes from
      not updating the copied_seq numbers and memory pressure resulting from
      that.
      
      We have a slight layer problem of calling tcp_eat_skb even if its not
      a TCP socket. To fix we could refactor and create per type receiver
      handlers. I decided this is more work than we want in the fix and we
      already have some small tweaks depending on caller that use the
      helper skb_bpf_strparser(). So we extend that a bit and always set
      the strparser bit when it is in use and then we can gate the
      seq_copied updates on this.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
      e5c6de5f
    • John Fastabend's avatar
      bpf, sockmap: Pass skb ownership through read_skb · 78fa0d61
      John Fastabend authored
      The read_skb hook calls consume_skb() now, but this means that if the
      recv_actor program wants to use the skb it needs to inc the ref cnt
      so that the consume_skb() doesn't kfree the sk_buff.
      
      This is problematic because in some error cases under memory pressure
      we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
      Then we get this,
      
       skb_linearize()
         __pskb_pull_tail()
           pskb_expand_head()
             BUG_ON(skb_shared(skb))
      
      Because we incremented users refcnt from sk_psock_verdict_recv() we
      hit the bug on with refcnt > 1 and trip it.
      
      To fix lets simply pass ownership of the sk_buff through the skb_read
      call. Then we can drop the consume from read_skb handlers and assume
      the verdict recv does any required kfree.
      
      Bug found while testing in our CI which runs in VMs that hit memory
      constraints rather regularly. William tested TCP read_skb handlers.
      
      [  106.536188] ------------[ cut here ]------------
      [  106.536197] kernel BUG at net/core/skbuff.c:1693!
      [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
      [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
      [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
      [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
      [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
      [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
      [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
      [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
      [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
      [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
      [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
      [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  106.542255] Call Trace:
      [  106.542383]  <IRQ>
      [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
      [  106.542681]  skb_ensure_writable+0x85/0xa0
      [  106.542882]  sk_skb_pull_data+0x18/0x20
      [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
      [  106.543536]  ? migrate_disable+0x66/0x80
      [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
      [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
      [  106.544561]  tcp_read_skb+0x7b/0x120
      [  106.544740]  tcp_data_queue+0x904/0xee0
      [  106.544931]  tcp_rcv_established+0x212/0x7c0
      [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
      [  106.545326]  tcp_v4_rcv+0xe70/0xf60
      [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
      [  106.545744]  ip_local_deliver_finish+0xa7/0x150
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Reported-by: default avatarWilliam Findlay <will@isovalent.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
      78fa0d61
  17. 20 May, 2023 1 commit
    • Aditi Ghag's avatar
      bpf: Add bpf_sock_destroy kfunc · 4ddbcb88
      Aditi Ghag authored
      The socket destroy kfunc is used to forcefully terminate sockets from
      certain BPF contexts. We plan to use the capability in Cilium
      load-balancing to terminate client sockets that continue to connect to
      deleted backends.  The other use case is on-the-fly policy enforcement
      where existing socket connections prevented by policies need to be
      forcefully terminated.  The kfunc also allows terminating sockets that may
      or may not be actively sending traffic.
      
      The kfunc can currently be called only from BPF TCP and UDP iterators
      where users can filter, and terminate selected sockets. More
      specifically, it can only be called from  BPF contexts that ensure
      socket locking in order to allow synchronous execution of protocol
      specific `diag_destroy` handlers. The previous commit that batches UDP
      sockets during iteration facilitated a synchronous invocation of the UDP
      destroy callback from BPF context by skipping socket locks in
      `udp_abort`. TCP iterator already supported batching of sockets being
      iterated. To that end, `tracing_iter_filter` callback filter is added so
      that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
      attach type, and reject other programs.
      
      The kfunc takes `sock_common` type argument, even though it expects, and
      casts them to a `sock` pointer. This enables the verifier to allow the
      sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
      `sock` structs. Furthermore, as `sock_common` only has a subset of
      certain fields of `sock`, casting pointer to the latter type might not
      always be safe for certain sockets like request sockets, but these have a
      special handling in the diag_destroy handlers.
      
      Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
      cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
      eg. getting a sk pointer (may be even NULL) by following another sk
      pointer. The pointer socket argument passed in TCP and UDP iterators is
      tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
      are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
      Signed-off-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      4ddbcb88
  18. 17 May, 2023 2 commits