1. 22 Jun, 2023 12 commits
    • Maciej Żenczykowski's avatar
      revert "net: align SO_RCVMARK required privileges with SO_MARK" · a9628e88
      Maciej Żenczykowski authored
      This reverts commit 1f86123b ("net: align SO_RCVMARK required
      privileges with SO_MARK") because the reasoning in the commit message
      is not really correct:
        SO_RCVMARK is used for 'reading' incoming skb mark (via cmsg), as such
        it is more equivalent to 'getsockopt(SO_MARK)' which has no priv check
        and retrieves the socket mark, rather than 'setsockopt(SO_MARK) which
        sets the socket mark and does require privs.
      
        Additionally incoming skb->mark may already be visible if
        sysctl_fwmark_reflect and/or sysctl_tcp_fwmark_accept are enabled.
      
        Furthermore, it is easier to block the getsockopt via bpf
        (either cgroup setsockopt hook, or via syscall filters)
        then to unblock it if it requires CAP_NET_RAW/ADMIN.
      
      On Android the socket mark is (among other things) used to store
      the network identifier a socket is bound to.  Setting it is privileged,
      but retrieving it is not.  We'd like unprivileged userspace to be able
      to read the network id of incoming packets (where mark is set via
      iptables [to be moved to bpf])...
      
      An alternative would be to add another sysctl to control whether
      setting SO_RCVMARK is privilged or not.
      (or even a MASK of which bits in the mark can be exposed)
      But this seems like over-engineering...
      
      Note: This is a non-trivial revert, due to later merged commit e42c7bee
      ("bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()")
      which changed both 'ns_capable' into 'sockopt_ns_capable' calls.
      
      Fixes: 1f86123b ("net: align SO_RCVMARK required privileges with SO_MARK")
      Cc: Larysa Zaremba <larysa.zaremba@intel.com>
      Cc: Simon Horman <simon.horman@corigine.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eyal Birger <eyal.birger@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Patrick Rohr <prohr@google.com>
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230618103130.51628-1-maze@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a9628e88
    • Kees Cook's avatar
      net: wwan: iosm: Convert single instance struct member to flexible array · dec24b3b
      Kees Cook authored
      struct mux_adth actually ends with multiple struct mux_adth_dg members.
      This is seen both in the comments about the member:
      
      /**
       * struct mux_adth - Structure of the Aggregated Datagram Table Header.
       ...
       * @dg:		datagramm table with variable length
       */
      
      and in the preparation for populating it:
      
                              adth_dg_size = offsetof(struct mux_adth, dg) +
                                              ul_adb->dg_count[i] * sizeof(*dg);
      			...
                              adth_dg_size -= offsetof(struct mux_adth, dg);
                              memcpy(&adth->dg, ul_adb->dg[i], adth_dg_size);
      
      This was reported as a run-time false positive warning:
      
      memcpy: detected field-spanning write (size 16) of single field "&adth->dg" at drivers/net/wwan/iosm/iosm_ipc_mux_codec.c:852 (size 8)
      
      Adjust the struct mux_adth definition and associated sizeof() math; no binary
      output differences are observed in the resulting object file.
      Reported-by: default avatarFlorian Klink <flokli@flokli.de>
      Closes: https://lore.kernel.org/lkml/dbfa25f5-64c8-5574-4f5d-0151ba95d232@gmail.com/
      Fixes: 1f52d7b6 ("net: wwan: iosm: Enable M.2 7360 WWAN card support")
      Cc: M Chetan Kumar <m.chetan.kumar@intel.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Intel Corporation <linuxwwan@intel.com>
      Cc: Loic Poulain <loic.poulain@linaro.org>
      Cc: Sergey Ryazanov <ryazanov.s.a@gmail.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230620194234.never.023-kees@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      dec24b3b
    • Eric Dumazet's avatar
      sch_netem: acquire qdisc lock in netem_change() · 2174a08d
      Eric Dumazet authored
      syzbot managed to trigger a divide error [1] in netem.
      
      It could happen if q->rate changes while netem_enqueue()
      is running, since q->rate is read twice.
      
      It turns out netem_change() always lacked proper synchronization.
      
      [1]
      divide error: 0000 [#1] SMP KASAN
      CPU: 1 PID: 7867 Comm: syz-executor.1 Not tainted 6.1.30-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
      RIP: 0010:div64_u64 include/linux/math64.h:69 [inline]
      RIP: 0010:packet_time_ns net/sched/sch_netem.c:357 [inline]
      RIP: 0010:netem_enqueue+0x2067/0x36d0 net/sched/sch_netem.c:576
      Code: 89 e2 48 69 da 00 ca 9a 3b 42 80 3c 28 00 4c 8b a4 24 88 00 00 00 74 0d 4c 89 e7 e8 c3 4f 3b fd 48 8b 4c 24 18 48 89 d8 31 d2 <49> f7 34 24 49 01 c7 4c 8b 64 24 48 4d 01 f7 4c 89 e3 48 c1 eb 03
      RSP: 0018:ffffc9000dccea60 EFLAGS: 00010246
      RAX: 000001a442624200 RBX: 000001a442624200 RCX: ffff888108a4f000
      RDX: 0000000000000000 RSI: 000000000000070d RDI: 000000000000070d
      RBP: ffffc9000dcceb90 R08: ffffffff849c5e26 R09: fffffbfff10e1297
      R10: 0000000000000000 R11: dffffc0000000001 R12: ffff888108a4f358
      R13: dffffc0000000000 R14: 0000001a8cd9a7ec R15: 0000000000000000
      FS: 00007fa73fe18700(0000) GS:ffff8881f6b00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa73fdf7718 CR3: 000000011d36e000 CR4: 0000000000350ee0
      Call Trace:
      <TASK>
      [<ffffffff84714385>] __dev_xmit_skb net/core/dev.c:3931 [inline]
      [<ffffffff84714385>] __dev_queue_xmit+0xcf5/0x3370 net/core/dev.c:4290
      [<ffffffff84d22df2>] dev_queue_xmit include/linux/netdevice.h:3030 [inline]
      [<ffffffff84d22df2>] neigh_hh_output include/net/neighbour.h:531 [inline]
      [<ffffffff84d22df2>] neigh_output include/net/neighbour.h:545 [inline]
      [<ffffffff84d22df2>] ip_finish_output2+0xb92/0x10d0 net/ipv4/ip_output.c:235
      [<ffffffff84d21e63>] __ip_finish_output+0xc3/0x2b0
      [<ffffffff84d10a81>] ip_finish_output+0x31/0x2a0 net/ipv4/ip_output.c:323
      [<ffffffff84d10f14>] NF_HOOK_COND include/linux/netfilter.h:298 [inline]
      [<ffffffff84d10f14>] ip_output+0x224/0x2a0 net/ipv4/ip_output.c:437
      [<ffffffff84d123b5>] dst_output include/net/dst.h:444 [inline]
      [<ffffffff84d123b5>] ip_local_out net/ipv4/ip_output.c:127 [inline]
      [<ffffffff84d123b5>] __ip_queue_xmit+0x1425/0x2000 net/ipv4/ip_output.c:542
      [<ffffffff84d12fdc>] ip_queue_xmit+0x4c/0x70 net/ipv4/ip_output.c:556
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230620184425.1179809-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      2174a08d
    • Danielle Ratson's avatar
      selftests: forwarding: Fix race condition in mirror installation · c7c059fb
      Danielle Ratson authored
      When mirroring to a gretap in hardware the device expects to be
      programmed with the egress port and all the encapsulating headers. This
      requires the driver to resolve the path the packet will take in the
      software data path and program the device accordingly.
      
      If the path cannot be resolved (in this case because of an unresolved
      neighbor), then mirror installation fails until the path is resolved.
      This results in a race that causes the test to sometimes fail.
      
      Fix this by setting the neighbor's state to permanent in a couple of
      tests, so that it is always valid.
      
      Fixes: 35c31d5c ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d")
      Fixes: 239e754a ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q")
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Link: https://lore.kernel.org/r/268816ac729cb6028c7a34d4dda6f4ec7af55333.1687264607.git.petrm@nvidia.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c7c059fb
    • Benjamin Berg's avatar
      wifi: mac80211: report all unusable beacon frames · 7f4e0970
      Benjamin Berg authored
      Properly check for RX_DROP_UNUSABLE now that the new drop reason
      infrastructure is used. Without this change, the comparison will always
      be false as a more specific reason is given in the lower bits of result.
      
      Fixes: baa951a1 ("mac80211: use the new drop reasons infrastructure")
      Signed-off-by: default avatarBenjamin Berg <benjamin.berg@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Link: https://lore.kernel.org/r/20230621120543.412920-2-johannes@sipsolutions.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7f4e0970
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-fixes-for-6-4' · 533aa0ba
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: fixes for 6.4
      
      Patch 1 correctly handles disconnect() failures that can happen in some
      specific cases: now the socket state is set as unconnected as expected.
      That fixes an issue introduced in v6.2.
      
      Patch 2 fixes a divide by zero bug in mptcp_recvmsg() with a fix similar
      to a recent one from Eric Dumazet for TCP introducing sk_wait_pending
      flag. It should address an issue present in MPTCP from almost the
      beginning, from v5.9.
      
      Patch 3 fixes a possible list corruption on passive MPJ even if the race
      seems very unlikely, better be safe than sorry. The possible issue is
      present from v5.17.
      
      Patch 4 consolidates fallback and non fallback state machines to avoid
      leaking some MPTCP sockets. The fix is likely needed for versions from
      v5.11.
      
      Patch 5 drops code that is no longer used after the introduction of
      patch 4/6. This is not really a fix but this patch can probably land in
      the -net tree as well not to leave unused code.
      
      Patch 6 ensures listeners are unhashed before updating their sk status
      to avoid possible deadlocks when diag info are going to be retrieved
      with a lock. Even if it should not be visible with the way we are
      currently getting diag info, the issue is present from v5.17.
      ====================
      
      Link: https://lore.kernel.org/r/20230620-upstream-net-20230620-misc-fixes-for-v6-4-v1-0-f36aa5eae8b9@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      533aa0ba
    • Paolo Abeni's avatar
      mptcp: ensure listener is unhashed before updating the sk status · 57fc0f1c
      Paolo Abeni authored
      The MPTCP protocol access the listener subflow in a lockless
      manner in a couple of places (poll, diag). That works only if
      the msk itself leaves the listener status only after that the
      subflow itself has been closed/disconnected. Otherwise we risk
      deadlock in diag, as reported by Christoph.
      
      Address the issue ensuring that the first subflow (the listener
      one) is always disconnected before updating the msk socket status.
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/407
      Fixes: b29fcfb5 ("mptcp: full disconnect implementation")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      57fc0f1c
    • Paolo Abeni's avatar
      mptcp: drop legacy code around RX EOF · b7535cfe
      Paolo Abeni authored
      Thanks to the previous patch -- "mptcp: consolidate fallback and non
      fallback state machine" -- we can finally drop the "temporary hack"
      used to detect rx eof.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b7535cfe
    • Paolo Abeni's avatar
      mptcp: consolidate fallback and non fallback state machine · 81c1d029
      Paolo Abeni authored
      An orphaned msk releases the used resources via the worker,
      when the latter first see the msk in CLOSED status.
      
      If the msk status transitions to TCP_CLOSE in the release callback
      invoked by the worker's final release_sock(), such instance of the
      workqueue will not take any action.
      
      Additionally the MPTCP code prevents scheduling the worker once the
      socket reaches the CLOSE status: such msk resources will be leaked.
      
      The only code path that can trigger the above scenario is the
      __mptcp_check_send_data_fin() in fallback mode.
      
      Address the issue removing the special handling of fallback socket
      in __mptcp_check_send_data_fin(), consolidating the state machine
      for fallback and non fallback socket.
      
      Since non-fallback sockets do not send and do not receive data_fin,
      the mptcp code can update the msk internal status to match the next
      step in the SM every time data fin (ack) should be generated or
      received.
      
      As a consequence we can remove a bunch of checks for fallback from
      the fastpath.
      
      Fixes: 6e628cd3 ("mptcp: use mptcp release_cb for delayed tasks")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      81c1d029
    • Paolo Abeni's avatar
      mptcp: fix possible list corruption on passive MPJ · 56a666c4
      Paolo Abeni authored
      At passive MPJ time, if the msk socket lock is held by the user,
      the new subflow is appended to the msk->join_list under the msk
      data lock.
      
      In mptcp_release_cb()/__mptcp_flush_join_list(), the subflows in
      that list are moved from the join_list into the conn_list under the
      msk socket lock.
      
      Append and removal could race, possibly corrupting such list.
      Address the issue splicing the join list into a temporary one while
      still under the msk data lock.
      
      Found by code inspection, the race itself should be almost impossible
      to trigger in practice.
      
      Fixes: 3e501490 ("mptcp: cleanup MPJ subflow list handling")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      56a666c4
    • Paolo Abeni's avatar
      mptcp: fix possible divide by zero in recvmsg() · 0ad529d9
      Paolo Abeni authored
      Christoph reported a divide by zero bug in mptcp_recvmsg():
      
      divide error: 0000 [#1] PREEMPT SMP
      CPU: 1 PID: 19978 Comm: syz-executor.6 Not tainted 6.4.0-rc2-gffcc7899081b #20
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:__tcp_select_window+0x30e/0x420 net/ipv4/tcp_output.c:3018
      Code: 11 ff 0f b7 cd c1 e9 0c b8 ff ff ff ff d3 e0 89 c1 f7 d1 01 cb 21 c3 eb 17 e8 2e 83 11 ff 31 db eb 0e e8 25 83 11 ff 89 d8 99 <f7> 7c 24 04 29 d3 65 48 8b 04 25 28 00 00 00 48 3b 44 24 10 75 60
      RSP: 0018:ffffc90000a07a18 EFLAGS: 00010246
      RAX: 000000000000ffd7 RBX: 000000000000ffd7 RCX: 0000000000040000
      RDX: 0000000000000000 RSI: 000000000003ffff RDI: 0000000000040000
      RBP: 000000000000ffd7 R08: ffffffff820cf297 R09: 0000000000000001
      R10: 0000000000000000 R11: ffffffff8103d1a0 R12: 0000000000003f00
      R13: 0000000000300000 R14: ffff888101cf3540 R15: 0000000000180000
      FS:  00007f9af4c09640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b33824000 CR3: 000000012f241001 CR4: 0000000000170ee0
      Call Trace:
       <TASK>
       __tcp_cleanup_rbuf+0x138/0x1d0 net/ipv4/tcp.c:1611
       mptcp_recvmsg+0xcb8/0xdd0 net/mptcp/protocol.c:2034
       inet_recvmsg+0x127/0x1f0 net/ipv4/af_inet.c:861
       ____sys_recvmsg+0x269/0x2b0 net/socket.c:1019
       ___sys_recvmsg+0xe6/0x260 net/socket.c:2764
       do_recvmmsg+0x1a5/0x470 net/socket.c:2858
       __do_sys_recvmmsg net/socket.c:2937 [inline]
       __se_sys_recvmmsg net/socket.c:2953 [inline]
       __x64_sys_recvmmsg+0xa6/0x130 net/socket.c:2953
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      RIP: 0033:0x7f9af58fc6a9
      Code: 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 37 0d 00 f7 d8 64 89 01 48
      RSP: 002b:00007f9af4c08cd8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
      RAX: ffffffffffffffda RBX: 00000000006bc050 RCX: 00007f9af58fc6a9
      RDX: 0000000000000001 RSI: 0000000020000140 RDI: 0000000000000004
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000f00 R11: 0000000000000246 R12: 00000000006bc05c
      R13: fffffffffffffea8 R14: 00000000006bc050 R15: 000000000001fe40
       </TASK>
      
      mptcp_recvmsg is allowed to release the msk socket lock when
      blocking, and before re-acquiring it another thread could have
      switched the sock to TCP_LISTEN status - with a prior
      connect(AF_UNSPEC) - also clearing icsk_ack.rcv_mss.
      
      Address the issue preventing the disconnect if some other process is
      concurrently performing a blocking syscall on the same socket, alike
      commit 4faeee0c ("tcp: deny tcp_disconnect() when threads are waiting").
      
      Fixes: a6b118fe ("mptcp: add receive buffer auto-tuning")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/404Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Tested-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0ad529d9
    • Paolo Abeni's avatar
      mptcp: handle correctly disconnect() failures · c2b2ae39
      Paolo Abeni authored
      Currently the mptcp code has assumes that disconnect() can fail only
      at mptcp_sendmsg_fastopen() time - to avoid a deadlock scenario - and
      don't even bother returning an error code.
      
      Soon mptcp_disconnect() will handle more error conditions: let's track
      them explicitly.
      
      As a bonus, explicitly annotate TCP-level disconnect as not failing:
      the mptcp code never blocks for event on the subflows.
      
      Fixes: 7d803344 ("mptcp: fix deadlock in fastopen error path")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Tested-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c2b2ae39
  2. 21 Jun, 2023 5 commits
  3. 20 Jun, 2023 12 commits
  4. 19 Jun, 2023 2 commits
  5. 18 Jun, 2023 3 commits
  6. 17 Jun, 2023 1 commit
    • Íñigo Huguet's avatar
      sfc: use budget for TX completions · 4aaf2c52
      Íñigo Huguet authored
      When running workloads heavy unbalanced towards TX (high TX, low RX
      traffic), sfc driver can retain the CPU during too long times. Although
      in many cases this is not enough to be visible, it can affect
      performance and system responsiveness.
      
      A way to reproduce it is to use a debug kernel and run some parallel
      netperf TX tests. In some systems, this will lead to this message being
      logged:
        kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
      
      The reason is that sfc driver doesn't account any NAPI budget for the TX
      completion events work. With high-TX/low-RX traffic, this makes that the
      CPU is held for long time for NAPI poll.
      
      Documentations says "drivers can process completions for any number of Tx
      packets but should only process up to budget number of Rx packets".
      However, many drivers do limit the amount of TX completions that they
      process in a single NAPI poll.
      
      In the same way, this patch adds a limit for the TX work in sfc. With
      the patch applied, the watchdog warning never appears.
      
      Tested with netperf in different combinations: single process / parallel
      processes, TCP / UDP and different sizes of UDP messages. Repeated the
      tests before and after the patch, without any noticeable difference in
      network or CPU performance.
      
      Test hardware:
      Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
      Solarflare Communications XtremeScale X2522-25G Network Adapter
      
      Fixes: 5227eccc ("sfc: remove tx and MCDI handling from NAPI budget consideration")
      Fixes: d19a5372 ("sfc_ef100: TX path for EF100 NICs")
      Reported-by: default avatarFei Liu <feliu@redhat.com>
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Link: https://lore.kernel.org/r/20230615084929.10506-1-ihuguet@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4aaf2c52
  7. 16 Jun, 2023 5 commits
    • Azeem Shaikh's avatar
      ieee802154: Replace strlcpy with strscpy · cd912503
      Azeem Shaikh authored
      strlcpy() reads the entire source buffer first.
      This read may exceed the destination size limit.
      This is both inefficient and can lead to linear read
      overflows if a source string is not NUL-terminated [1].
      In an effort to remove strlcpy() completely [2], replace
      strlcpy() here with strscpy().
      
      Direct replacement is safe here since the return values
      from the helper macros are ignored by the callers.
      
      [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
      [2] https://github.com/KSPP/linux/issues/89Signed-off-by: default avatarAzeem Shaikh <azeemshaikh38@gmail.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230613003326.3538391-1-azeemshaikh38@gmail.comSigned-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      cd912503
    • Leon Romanovsky's avatar
      net/mlx5e: Fix scheduling of IPsec ASO query while in atomic · a128f9d4
      Leon Romanovsky authored
      ASO query can be scheduled in atomic context as such it can't use usleep.
      Use udelay as recommended in Documentation/timers/timers-howto.rst.
      
      Fixes: 76e463f6 ("net/mlx5e: Overcome slow response for first IPsec ASO WQE")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a128f9d4
    • Leon Romanovsky's avatar
      net/mlx5e: Drop XFRM state lock when modifying flow steering · c75b9425
      Leon Romanovsky authored
      XFRM state which is changed to be XFRM_STATE_EXPIRED doesn't really
      need to hold lock while modifying flow steering rules to drop traffic.
      
      That state can be deleted only and as such mlx5e_ipsec_handle_tx_limit()
      work will be canceled anyway and won't run in parallel.
      
      Fixes: b2f7b01d ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c75b9425
    • Patrisious Haddad's avatar
      net/mlx5e: Fix ESN update kernel panic · fef06678
      Patrisious Haddad authored
      Previously during mlx5e_ipsec_handle_event the driver tried to execute
      an operation that could sleep, while holding a spinlock, which caused
      the kernel panic mentioned below.
      
      Move the function call that can sleep outside of the spinlock context.
      
       Call Trace:
       <TASK>
       dump_stack_lvl+0x49/0x6c
       __schedule_bug.cold+0x42/0x4e
       schedule_debug.constprop.0+0xe0/0x118
       __schedule+0x59/0x58a
       ? __mod_timer+0x2a1/0x3ef
       schedule+0x5e/0xd4
       schedule_timeout+0x99/0x164
       ? __pfx_process_timeout+0x10/0x10
       __wait_for_common+0x90/0x1da
       ? __pfx_schedule_timeout+0x10/0x10
       wait_func+0x34/0x142 [mlx5_core]
       mlx5_cmd_invoke+0x1f3/0x313 [mlx5_core]
       cmd_exec+0x1fe/0x325 [mlx5_core]
       mlx5_cmd_do+0x22/0x50 [mlx5_core]
       mlx5_cmd_exec+0x1c/0x40 [mlx5_core]
       mlx5_modify_ipsec_obj+0xb2/0x17f [mlx5_core]
       mlx5e_ipsec_update_esn_state+0x69/0xf0 [mlx5_core]
       ? wake_affine+0x62/0x1f8
       mlx5e_ipsec_handle_event+0xb1/0xc0 [mlx5_core]
       process_one_work+0x1e2/0x3e6
       ? __pfx_worker_thread+0x10/0x10
       worker_thread+0x54/0x3ad
       ? __pfx_worker_thread+0x10/0x10
       kthread+0xda/0x101
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x29/0x37
       </TASK>
       BUG: workqueue leaked lock or atomic: kworker/u256:4/0x7fffffff/189754#012     last function: mlx5e_ipsec_handle_event [mlx5_core]
       CPU: 66 PID: 189754 Comm: kworker/u256:4 Kdump: loaded Tainted: G        W          6.2.0-2596.20230309201517_5.el8uek.rc1.x86_64 #2
       Hardware name: Oracle Corporation ORACLE SERVER X9-2/ASMMBX9-2, BIOS 61070300 08/17/2022
       Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_event [mlx5_core]
       Call Trace:
       <TASK>
       dump_stack_lvl+0x49/0x6c
       process_one_work.cold+0x2b/0x3c
       ? __pfx_worker_thread+0x10/0x10
       worker_thread+0x54/0x3ad
       ? __pfx_worker_thread+0x10/0x10
       kthread+0xda/0x101
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x29/0x37
       </TASK>
       BUG: scheduling while atomic: kworker/u256:4/189754/0x00000000
      
      Fixes: cee137a6 ("net/mlx5e: Handle ESN update events")
      Signed-off-by: default avatarPatrisious Haddad <phaddad@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      fef06678
    • Leon Romanovsky's avatar
      net/mlx5e: Don't delay release of hardware objects · cf5bb023
      Leon Romanovsky authored
      XFRM core provides two callbacks to release resources, one is .xdo_dev_policy_delete()
      and another is .xdo_dev_policy_free(). This separation allows delayed release so
      "ip xfrm policy free" commands won't starve. Unfortunately, mlx5 command interface
      can't run in .xdo_dev_policy_free() callbacks as the latter runs in ATOMIC context.
      
       BUG: scheduling while atomic: swapper/7/0/0x00000100
       Modules linked in: act_mirred act_tunnel_key cls_flower sch_ingress vxlan mlx5_vdpa vringh vhost_iotlb vdpa rpcrdma rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
       CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.3.0+ #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <IRQ>
        dump_stack_lvl+0x33/0x50
        __schedule_bug+0x4e/0x60
        __schedule+0x5d5/0x780
        ? __mod_timer+0x286/0x3d0
        schedule+0x50/0x90
        schedule_timeout+0x7c/0xf0
        ? __bpf_trace_tick_stop+0x10/0x10
        __wait_for_common+0x88/0x190
        ? usleep_range_state+0x90/0x90
        cmd_exec+0x42e/0xb40 [mlx5_core]
        mlx5_cmd_do+0x1e/0x40 [mlx5_core]
        mlx5_cmd_exec+0x18/0x30 [mlx5_core]
        mlx5_cmd_delete_fte+0xa8/0xd0 [mlx5_core]
        del_hw_fte+0x60/0x120 [mlx5_core]
        mlx5_del_flow_rules+0xec/0x270 [mlx5_core]
        ? default_send_IPI_single_phys+0x26/0x30
        mlx5e_accel_ipsec_fs_del_pol+0x1a/0x60 [mlx5_core]
        mlx5e_xfrm_free_policy+0x15/0x20 [mlx5_core]
        xfrm_policy_destroy+0x5a/0xb0
        xfrm4_dst_destroy+0x7b/0x100
        dst_destroy+0x37/0x120
        rcu_core+0x2d6/0x540
        __do_softirq+0xcd/0x273
        irq_exit_rcu+0x82/0xb0
        sysvec_apic_timer_interrupt+0x72/0x90
        </IRQ>
        <TASK>
        asm_sysvec_apic_timer_interrupt+0x16/0x20
       RIP: 0010:default_idle+0x13/0x20
       Code: c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 8b 05 7a 4d ee 00 85 c0 7e 07 0f 00 2d 2f 98 2e 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 40 b4 02 00
       RSP: 0018:ffff888100843ee0 EFLAGS: 00000242
       RAX: 0000000000000001 RBX: ffff888100812b00 RCX: 4000000000000000
       RDX: 0000000000000001 RSI: 0000000000000083 RDI: 000000000002d2ec
       RBP: 0000000000000007 R08: 00000021daeded59 R09: 0000000000000001
       R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        default_idle_call+0x30/0xb0
        do_idle+0x1c1/0x1d0
        cpu_startup_entry+0x19/0x20
        start_secondary+0xfe/0x120
        secondary_startup_64_no_verify+0xf3/0xfb
        </TASK>
       bad: scheduling from the idle thread!
      
      Fixes: a5b8ca94 ("net/mlx5e: Add XFRM policy offload logic")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      cf5bb023