1. 11 Aug, 2022 10 commits
    • Alexei Starovoitov's avatar
      bpf: Shut up kern_sys_bpf warning. · 4e4588f1
      Alexei Starovoitov authored
      Shut up this warning:
      kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
      int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4e4588f1
    • Maxim Mikityanskiy's avatar
      net/tls: Use RCU API to access tls_ctx->netdev · 94ce3b64
      Maxim Mikityanskiy authored
      Currently, tls_device_down synchronizes with tls_device_resync_rx using
      RCU, however, the pointer to netdev is stored using WRITE_ONCE and
      loaded using READ_ONCE.
      
      Although such approach is technically correct (rcu_dereference is
      essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
      NULL), using special RCU helpers for pointers is more valid, as it
      includes additional checks and might change the implementation
      transparently to the callers.
      
      Mark the netdev pointer as __rcu and use the correct RCU helpers to
      access it. For non-concurrent access pass the right conditions that
      guarantee safe access (locks taken, refcount value). Also use the
      correct helper in mlx5e, where even READ_ONCE was missing.
      
      The transition to RCU exposes existing issues, fixed by this commit:
      
      1. bond_tls_device_xmit could read netdev twice, and it could become
      NULL the second time, after the NULL check passed.
      
      2. Drivers shouldn't stop processing the last packet if tls_device_down
      just set netdev to NULL, before tls_dev_del was called. This prevents a
      possible packet drop when transitioning to the fallback software mode.
      
      Fixes: 89df6a81 ("net/bonding: Implement TLS TX device offload")
      Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      94ce3b64
    • Jakub Kicinski's avatar
      tls: rx: device: don't try to copy too much on detach · d800a7b3
      Jakub Kicinski authored
      Another device offload bug, we use the length of the output
      skb as an indication of how much data to copy. But that skb
      is sized to offset + record length, and we start from offset.
      So we end up double-counting the offset which leads to
      skb_copy_bits() returning -EFAULT.
      Reported-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
      Tested-by: default avatarRan Rozenstein <ranro@nvidia.com>
      Link: https://lore.kernel.org/r/20220809175544.354343-2-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d800a7b3
    • Jakub Kicinski's avatar
      tls: rx: device: bound the frag walk · 86b259f6
      Jakub Kicinski authored
      We can't do skb_walk_frags() on the input skbs, because
      the input skbs is really just a pointer to the tcp read
      queue. We need to bound the "is decrypted" check by the
      amount of data in the message.
      
      Note that the walk in tls_device_reencrypt() is after a
      CoW so the skb there is safe to walk. Actually in the
      current implementation it can't have frags at all, but
      whatever, maybe one day it will.
      Reported-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
      Tested-by: default avatarRan Rozenstein <ranro@nvidia.com>
      Link: https://lore.kernel.org/r/20220809175544.354343-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      86b259f6
    • Thadeu Lima de Souza Cascardo's avatar
      net_sched: cls_route: remove from list when handle is 0 · 9ad36309
      Thadeu Lima de Souza Cascardo authored
      When a route filter is replaced and the old filter has a 0 handle, the old
      one won't be removed from the hashtable, while it will still be freed.
      
      The test was there since before commit 1109c005 ("net: sched: RCU
      cls_route"), when a new filter was not allocated when there was an old one.
      The old filter was reused and the reinserting would only be necessary if an
      old filter was replaced. That was still wrong for the same case where the
      old handle was 0.
      
      Remove the old filter from the list independently from its handle value.
      
      This fixes CVE-2022-2588, also reported as ZDI-CAN-17440.
      Reported-by: default avatarZhenpeng Lin <zplin@u.northwestern.edu>
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Reviewed-by: default avatarKamal Mostafa <kamal@canonical.com>
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20220809170518.164662-1-cascardo@canonical.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9ad36309
    • Ido Schimmel's avatar
      selftests: forwarding: Fix failing tests with old libnet · 8bcfb4ae
      Ido Schimmel authored
      The custom multipath hash tests use mausezahn in order to test how
      changes in various packet fields affect the packet distribution across
      the available nexthops.
      
      The tool uses the libnet library for various low-level packet
      construction and injection. The library started using the
      "SO_BINDTODEVICE" socket option for IPv6 sockets in version 1.1.6 and
      for IPv4 sockets in version 1.2.
      
      When the option is not set, packets are not routed according to the
      table associated with the VRF master device and tests fail.
      
      Fix this by prefixing the command with "ip vrf exec", which will cause
      the route lookup to occur in the VRF routing table. This makes the tests
      pass regardless of the libnet library version.
      
      Fixes: 511e8db5 ("selftests: forwarding: Add test for custom multipath hash")
      Fixes: 185b0c19 ("selftests: forwarding: Add test for custom multipath hash with IPv4 GRE")
      Fixes: b7715acb ("selftests: forwarding: Add test for custom multipath hash with IPv6 GRE")
      Reported-by: default avatarIvan Vecera <ivecera@redhat.com>
      Tested-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Link: https://lore.kernel.org/r/20220809113320.751413-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8bcfb4ae
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fbe8870f
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      bpf 2022-08-10
      
      We've added 23 non-merge commits during the last 7 day(s) which contain
      a total of 19 files changed, 424 insertions(+), 35 deletions(-).
      
      The main changes are:
      
      1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.
      
      2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.
      
      3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.
      
      4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.
      
      5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.
      
      6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.
      
      7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.
      
      8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.
      
      9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
         offset buffer, from Aijun Sun.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
        selftests/bpf: Ensure sleepable program is rejected by hash map iter
        selftests/bpf: Add write tests for sk local storage map iterator
        selftests/bpf: Add tests for reading a dangling map iter fd
        bpf: Only allow sleepable program for resched-able iterator
        bpf: Check the validity of max_rdwr_access for sock local storage map iterator
        bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
        bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
        bpf: Acquire map uref in .init_seq_private for hash map iterator
        bpf: Acquire map uref in .init_seq_private for array map iterator
        bpf: Disallow bpf programs call prog_run command.
        bpf, arm64: Fix bpf trampoline instruction endianness
        selftests/bpf: Add test for prealloc_lru_pop bug
        bpf: Don't reinit map value in prealloc_lru_pop
        bpf: Allow calling bpf_prog_test kfuncs in tracing programs
        bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
        selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
        bpf: Use proper target btf when exporting attach_btf_obj_id
        mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
        bpf: Cleanup ftrace hash in bpf_trampoline_put
        BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fbe8870f
    • Jakub Kicinski's avatar
      Merge branch 'net-enhancements-to-sk_user_data-field' · dd48f383
      Jakub Kicinski authored
      Hawkins Jiawei says:
      
      ====================
      net: enhancements to sk_user_data field
      
      This patchset fixes refcount bug by adding SK_USER_DATA_PSOCK flag bit in
      sk_user_data field. The bug cause following info:
      
      WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
      Modules linked in:
      CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda #0
       <TASK>
       __refcount_add_not_zero include/linux/refcount.h:163 [inline]
       __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
       refcount_inc_not_zero include/linux/refcount.h:245 [inline]
       sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
       tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
       tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
       tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
       tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
       tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2849
       release_sock+0x54/0x1b0 net/core/sock.c:3404
       inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
       __sys_shutdown_sock net/socket.c:2331 [inline]
       __sys_shutdown_sock net/socket.c:2325 [inline]
       __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
       __do_sys_shutdown net/socket.c:2351 [inline]
       __se_sys_shutdown net/socket.c:2349 [inline]
       __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
       </TASK>
      
      To improve code maintainability, this patchset refactors sk_user_data
      flags code to be more generic.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1659676823.git.yin31149@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dd48f383
    • Hawkins Jiawei's avatar
      net: refactor bpf_sk_reuseport_detach() · cf8c1e96
      Hawkins Jiawei authored
      Refactor sk_user_data dereference using more generic function
      __rcu_dereference_sk_user_data_with_flags(), which improve its
      maintainability
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarHawkins Jiawei <yin31149@gmail.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cf8c1e96
    • Hawkins Jiawei's avatar
      net: fix refcount bug in sk_psock_get (2) · 2a013372
      Hawkins Jiawei authored
      Syzkaller reports refcount bug as follows:
      ------------[ cut here ]------------
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
      Modules linked in:
      CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda #0
       <TASK>
       __refcount_add_not_zero include/linux/refcount.h:163 [inline]
       __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
       refcount_inc_not_zero include/linux/refcount.h:245 [inline]
       sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
       tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
       tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
       tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
       tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
       tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2849
       release_sock+0x54/0x1b0 net/core/sock.c:3404
       inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
       __sys_shutdown_sock net/socket.c:2331 [inline]
       __sys_shutdown_sock net/socket.c:2325 [inline]
       __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
       __do_sys_shutdown net/socket.c:2351 [inline]
       __se_sys_shutdown net/socket.c:2349 [inline]
       __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
       </TASK>
      
      During SMC fallback process in connect syscall, kernel will
      replaces TCP with SMC. In order to forward wakeup
      smc socket waitqueue after fallback, kernel will sets
      clcsk->sk_user_data to origin smc socket in
      smc_fback_replace_callbacks().
      
      Later, in shutdown syscall, kernel will calls
      sk_psock_get(), which treats the clcsk->sk_user_data
      as psock type, triggering the refcnt warning.
      
      So, the root cause is that smc and psock, both will use
      sk_user_data field. So they will mismatch this field
      easily.
      
      This patch solves it by using another bit(defined as
      SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
      sk_user_data points to a psock object or not.
      This patch depends on a PTRMASK introduced in commit f1ff5ce2
      ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
      
      For there will possibly be more flags in the sk_user_data field,
      this patch also refactor sk_user_data flags code to be more generic
      to improve its maintainability.
      
      Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarHawkins Jiawei <yin31149@gmail.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2a013372
  2. 10 Aug, 2022 30 commits