1. 23 May, 2023 8 commits
    • John Fastabend's avatar
      bpf, sockmap: Incorrectly handling copied_seq · e5c6de5f
      John Fastabend authored
      The read_skb() logic is incrementing the tcp->copied_seq which is used for
      among other things calculating how many outstanding bytes can be read by
      the application. This results in application errors, if the application
      does an ioctl(FIONREAD) we return zero because this is calculated from
      the copied_seq value.
      
      To fix this we move tcp->copied_seq accounting into the recv handler so
      that we update these when the recvmsg() hook is called and data is in
      fact copied into user buffers. This gives an accurate FIONREAD value
      as expected and improves ACK handling. Before we were calling the
      tcp_rcv_space_adjust() which would update 'number of bytes copied to
      user in last RTT' which is wrong for programs returning SK_PASS. The
      bytes are only copied to the user when recvmsg is handled.
      
      Doing the fix for recvmsg is straightforward, but fixing redirect and
      SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
      call this from skmsg handlers. This fixes another issue where a broken
      socket with a BPF program doing a resubmit could hang the receiver. This
      happened because although read_skb() consumed the skb through sock_drop()
      it did not update the copied_seq. Now if a single reccv socket is
      redirecting to many sockets (for example for lb) the receiver sk will be
      hung even though we might expect it to continue. The hang comes from
      not updating the copied_seq numbers and memory pressure resulting from
      that.
      
      We have a slight layer problem of calling tcp_eat_skb even if its not
      a TCP socket. To fix we could refactor and create per type receiver
      handlers. I decided this is more work than we want in the fix and we
      already have some small tweaks depending on caller that use the
      helper skb_bpf_strparser(). So we extend that a bit and always set
      the strparser bit when it is in use and then we can gate the
      seq_copied updates on this.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
      e5c6de5f
    • John Fastabend's avatar
      bpf, sockmap: Wake up polling after data copy · 6df7f764
      John Fastabend authored
      When TCP stack has data ready to read sk_data_ready() is called. Sockmap
      overwrites this with its own handler to call into BPF verdict program.
      But, the original TCP socket had sock_def_readable that would additionally
      wake up any user space waiters with sk_wake_async().
      
      Sockmap saved the callback when the socket was created so call the saved
      data ready callback and then we can wake up any epoll() logic waiting
      on the read.
      
      Note we call on 'copied >= 0' to account for returning 0 when a FIN is
      received because we need to wake up user for this as well so they
      can do the recvmsg() -> 0 and detect the shutdown.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
      6df7f764
    • John Fastabend's avatar
      bpf, sockmap: TCP data stall on recv before accept · ea444185
      John Fastabend authored
      A common mechanism to put a TCP socket into the sockmap is to hook the
      BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
      that can map the socket info to the correct BPF verdict parser. When
      the user adds the socket to the map the psock is created and the new
      ops are assigned to ensure the verdict program will 'see' the sk_buffs
      as they arrive.
      
      Part of this process hooks the sk_data_ready op with a BPF specific
      handler to wake up the BPF verdict program when data is ready to read.
      The logic is simple enough (posted here for easy reading)
      
       static void sk_psock_verdict_data_ready(struct sock *sk)
       {
      	struct socket *sock = sk->sk_socket;
      
      	if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
      		return;
      	sock->ops->read_skb(sk, sk_psock_verdict_recv);
       }
      
      The oversight here is sk->sk_socket is not assigned until the application
      accepts() the new socket. However, its entirely ok for the peer application
      to do a connect() followed immediately by sends. The socket on the receiver
      is sitting on the backlog queue of the listening socket until its accepted
      and the data is queued up. If the peer never accepts the socket or is slow
      it will eventually hit data limits and rate limit the session. But,
      important for BPF sockmap hooks when this data is received TCP stack does
      the sk_data_ready() call but the read_skb() for this data is never called
      because sk_socket is missing. The data sits on the sk_receive_queue.
      
      Then once the socket is accepted if we never receive more data from the
      peer there will be no further sk_data_ready calls and all the data
      is still on the sk_receive_queue(). Then user calls recvmsg after accept()
      and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
      The handler checks for data in the sk_msg ingress queue expecting that
      the BPF program has already run from the sk_data_ready hook and enqueued
      the data as needed. So we are stuck.
      
      To fix do an unlikely check in recvmsg handler for data on the
      sk_receive_queue and if it exists wake up data_ready. We have the sock
      locked in both read_skb and recvmsg so should avoid having multiple
      runners.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com
      ea444185
    • John Fastabend's avatar
      bpf, sockmap: Handle fin correctly · 901546fd
      John Fastabend authored
      The sockmap code is returning EAGAIN after a FIN packet is received and no
      more data is on the receive queue. Correct behavior is to return 0 to the
      user and the user can then close the socket. The EAGAIN causes many apps
      to retry which masks the problem. Eventually the socket is evicted from
      the sockmap because its released from sockmap sock free handling. The
      issue creates a delay and can cause some errors on application side.
      
      To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
      is set then set return to zero. A selftest will be added to check this
      condition.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
      901546fd
    • John Fastabend's avatar
      bpf, sockmap: Improved check for empty queue · 405df89d
      John Fastabend authored
      We noticed some rare sk_buffs were stepping past the queue when system was
      under memory pressure. The general theory is to skip enqueueing
      sk_buffs when its not necessary which is the normal case with a system
      that is properly provisioned for the task, no memory pressure and enough
      cpu assigned.
      
      But, if we can't allocate memory due to an ENOMEM error when enqueueing
      the sk_buff into the sockmap receive queue we push it onto a delayed
      workqueue to retry later. When a new sk_buff is received we then check
      if that queue is empty. However, there is a problem with simply checking
      the queue length. When a sk_buff is being processed from the ingress queue
      but not yet on the sockmap msg receive queue its possible to also recv
      a sk_buff through normal path. It will check the ingress queue which is
      zero and then skip ahead of the pkt being processed.
      
      Previously we used sock lock from both contexts which made the problem
      harder to hit, but not impossible.
      
      To fix instead of popping the skb from the queue entirely we peek the
      skb from the queue and do the copy there. This ensures checks to the
      queue length are non-zero while skb is being processed. Then finally
      when the entire skb has been copied to user space queue or another
      socket we pop it off the queue. This way the queue length check allows
      bypassing the queue only after the list has been completely processed.
      
      To reproduce issue we run NGINX compliance test with sockmap running and
      observe some flakes in our testing that we attributed to this issue.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
      405df89d
    • John Fastabend's avatar
      bpf, sockmap: Reschedule is now done through backlog · bce22552
      John Fastabend authored
      Now that the backlog manages the reschedule() logic correctly we can drop
      the partial fix to reschedule from recvmsg hook.
      
      Rescheduling on recvmsg hook was added to address a corner case where we
      still had data in the backlog state but had nothing to kick it and
      reschedule the backlog worker to run and finish copying data out of the
      state. This had a couple limitations, first it required user space to
      kick it introducing an unnecessary EBUSY and retry. Second it only
      handled the ingress case and egress redirects would still be hung.
      
      With the correct fix, pushing the reschedule logic down to where the
      enomem error occurs we can drop this fix.
      
      Fixes: bec21719 ("skmsg: Schedule psock work if the cached skb exists on the psock")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
      bce22552
    • John Fastabend's avatar
      bpf, sockmap: Convert schedule_work into delayed_work · 29173d07
      John Fastabend authored
      Sk_buffs are fed into sockmap verdict programs either from a strparser
      (when the user might want to decide how framing of skb is done by attaching
      another parser program) or directly through tcp_read_sock. The
      tcp_read_sock is the preferred method for performance when the BPF logic is
      a stream parser.
      
      The flow for Cilium's common use case with a stream parser is,
      
       tcp_read_sock()
        sk_psock_verdict_recv
          ret = bpf_prog_run_pin_on_cpu()
          sk_psock_verdict_apply(sock, skb, ret)
           // if system is under memory pressure or app is slow we may
           // need to queue skb. Do this queuing through ingress_skb and
           // then kick timer to wake up handler
           skb_queue_tail(ingress_skb, skb)
           schedule_work(work);
      
      The work queue is wired up to sk_psock_backlog(). This will then walk the
      ingress_skb skb list that holds our sk_buffs that could not be handled,
      but should be OK to run at some later point. However, its possible that
      the workqueue doing this work still hits an error when sending the skb.
      When this happens the skbuff is requeued on a temporary 'state' struct
      kept with the workqueue. This is necessary because its possible to
      partially send an skbuff before hitting an error and we need to know how
      and where to restart when the workqueue runs next.
      
      Now for the trouble, we don't rekick the workqueue. This can cause a
      stall where the skbuff we just cached on the state variable might never
      be sent. This happens when its the last packet in a flow and no further
      packets come along that would cause the system to kick the workqueue from
      that side.
      
      To fix we could do simple schedule_work(), but while under memory pressure
      it makes sense to back off some instead of continue to retry repeatedly. So
      instead to fix convert schedule_work to schedule_delayed_work and add
      backoff logic to reschedule from backlog queue on errors. Its not obvious
      though what a good backoff is so use '1'.
      
      To test we observed some flakes whil running NGINX compliance test with
      sockmap we attributed these failed test to this bug and subsequent issue.
      
      >From on list discussion. This commit
      
       bec21719("skmsg: Schedule psock work if the cached skb exists on the psock")
      
      was intended to address similar race, but had a couple cases it missed.
      Most obvious it only accounted for receiving traffic on the local socket
      so if redirecting into another socket we could still get an sk_buff stuck
      here. Next it missed the case where copied=0 in the recv() handler and
      then we wouldn't kick the scheduler. Also its sub-optimal to require
      userspace to kick the internal mechanisms of sockmap to wake it up and
      copy data to user. It results in an extra syscall and requires the app
      to actual handle the EAGAIN correctly.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
      29173d07
    • John Fastabend's avatar
      bpf, sockmap: Pass skb ownership through read_skb · 78fa0d61
      John Fastabend authored
      The read_skb hook calls consume_skb() now, but this means that if the
      recv_actor program wants to use the skb it needs to inc the ref cnt
      so that the consume_skb() doesn't kfree the sk_buff.
      
      This is problematic because in some error cases under memory pressure
      we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
      Then we get this,
      
       skb_linearize()
         __pskb_pull_tail()
           pskb_expand_head()
             BUG_ON(skb_shared(skb))
      
      Because we incremented users refcnt from sk_psock_verdict_recv() we
      hit the bug on with refcnt > 1 and trip it.
      
      To fix lets simply pass ownership of the sk_buff through the skb_read
      call. Then we can drop the consume from read_skb handlers and assume
      the verdict recv does any required kfree.
      
      Bug found while testing in our CI which runs in VMs that hit memory
      constraints rather regularly. William tested TCP read_skb handlers.
      
      [  106.536188] ------------[ cut here ]------------
      [  106.536197] kernel BUG at net/core/skbuff.c:1693!
      [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
      [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
      [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
      [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
      [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
      [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
      [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
      [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
      [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
      [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
      [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
      [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  106.542255] Call Trace:
      [  106.542383]  <IRQ>
      [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
      [  106.542681]  skb_ensure_writable+0x85/0xa0
      [  106.542882]  sk_skb_pull_data+0x18/0x20
      [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
      [  106.543536]  ? migrate_disable+0x66/0x80
      [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
      [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
      [  106.544561]  tcp_read_skb+0x7b/0x120
      [  106.544740]  tcp_data_queue+0x904/0xee0
      [  106.544931]  tcp_rcv_established+0x212/0x7c0
      [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
      [  106.545326]  tcp_v4_rcv+0xe70/0xf60
      [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
      [  106.545744]  ip_local_deliver_finish+0xa7/0x150
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Reported-by: default avatarWilliam Findlay <will@isovalent.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
      78fa0d61
  2. 22 May, 2023 1 commit
  3. 19 May, 2023 1 commit
    • Will Deacon's avatar
      bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields · 0613d8ca
      Will Deacon authored
      A narrow load from a 64-bit context field results in a 64-bit load
      followed potentially by a 64-bit right-shift and then a bitwise AND
      operation to extract the relevant data.
      
      In the case of a 32-bit access, an immediate mask of 0xffffffff is used
      to construct a 64-bit BPP_AND operation which then sign-extends the mask
      value and effectively acts as a glorified no-op. For example:
      
      0:	61 10 00 00 00 00 00 00	r0 = *(u32 *)(r1 + 0)
      
      results in the following code generation for a 64-bit field:
      
      	ldr	x7, [x7]	// 64-bit load
      	mov	x10, #0xffffffffffffffff
      	and	x7, x7, x10
      
      Fix the mask generation so that narrow loads always perform a 32-bit AND
      operation:
      
      	ldr	x7, [x7]	// 64-bit load
      	mov	w10, #0xffffffff
      	and	w7, w7, w10
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
      Cc: Andrey Ignatov <rdna@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Fixes: 31fd8581 ("bpf: permits narrower load from bpf program context fields")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0613d8ca
  4. 16 May, 2023 1 commit
  5. 15 May, 2023 1 commit
  6. 14 May, 2023 1 commit
  7. 13 May, 2023 10 commits
  8. 12 May, 2023 15 commits
  9. 11 May, 2023 2 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 6e27831b
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from netfilter.
      
        Current release - regressions:
      
         - mtk_eth_soc: fix NULL pointer dereference
      
        Previous releases - regressions:
      
         - core:
            - skb_partial_csum_set() fix against transport header magic value
            - fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
            - annotate sk->sk_err write from do_recvmmsg()
            - add vlan_get_protocol_and_depth() helper
      
         - netlink: annotate accesses to nlk->cb_running
      
         - netfilter: always release netdev hooks from notifier
      
        Previous releases - always broken:
      
         - core: deal with most data-races in sk_wait_event()
      
         - netfilter: fix possible bug_on with enable_hooks=1
      
         - eth: bonding: fix send_peer_notif overflow
      
         - eth: xpcs: fix incorrect number of interfaces
      
         - eth: ipvlan: fix out-of-bounds caused by unclear skb->cb
      
         - eth: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register"
      
      * tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
        af_unix: Fix data races around sk->sk_shutdown.
        af_unix: Fix a data race of sk->sk_receive_queue->qlen.
        net: datagram: fix data-races in datagram_poll()
        net: mscc: ocelot: fix stat counter register values
        ipvlan:Fix out-of-bounds caused by unclear skb->cb
        docs: networking: fix x25-iface.rst heading & index order
        gve: Remove the code of clearing PBA bit
        tcp: add annotations around sk->sk_shutdown accesses
        net: add vlan_get_protocol_and_depth() helper
        net: pcs: xpcs: fix incorrect number of interfaces
        net: deal with most data-races in sk_wait_event()
        net: annotate sk->sk_err write from do_recvmmsg()
        netlink: annotate accesses to nlk->cb_running
        kselftest: bonding: add num_grat_arp test
        selftests: forwarding: lib: add netns support for tc rule handle stats get
        Documentation: bonding: fix the doc of peer_notif_delay
        bonding: fix send_peer_notif overflow
        net: ethernet: mtk_eth_soc: fix NULL pointer dereference
        selftests: nft_flowtable.sh: check ingress/egress chain too
        selftests: nft_flowtable.sh: monitor result file sizes
        ...
      6e27831b
    • Linus Torvalds's avatar
      Merge tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 691e1eee
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
      
       - fix some unused-variable warning in mtk-mdp3
      
       - ignore unused suspend operations in nxp
      
       - some driver fixes in rcar-vin
      
      * tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: platform: mtk-mdp3: work around unused-variable warning
        media: nxp: ignore unused suspend operations
        media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE
        media: rcar-vin: Fix NV12 size alignment
        media: rcar-vin: Gen3 can not scale NV12
      691e1eee