1. 24 Jun, 2021 23 commits
  2. 23 Jun, 2021 15 commits
    • David S. Miller's avatar
      Merge branch 'devlink-rate-limit-fixes' · 35713d9b
      David S. Miller authored
      Dmytro Linkin says:
      
      ====================
      Fixes for devlink rate objects API
      
      Patch #1 fixes not decreased refcount of parent node for destroyed leaf
      object.
      
      Patch #2 fixes incorect eswitch mode check.
      
      Patch #3 protects list traversing with a lock.
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35713d9b
    • Dmytro Linkin's avatar
      devlink: Protect rate list with lock while switching modes · a3e5e579
      Dmytro Linkin authored
      Devlink eswitch set command doesn't hold devlink->lock, which makes
      possible race condition between rate list traversing and others devlink
      rate KAPI calls, like devlink_rate_nodes_destroy().
      Hold devlink lock while traversing the list.
      
      Fixes: a8ecb93e ("devlink: Introduce rate nodes")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3e5e579
    • Dmytro Linkin's avatar
      devlink: Remove eswitch mode check for mode set call · ff99324d
      Dmytro Linkin authored
      When eswitch is disabled, querying its current mode results in error.
      Due to this when trying to set the eswitch mode for mlx5 devices, it
      fails to set the eswitch switchdev mode.
      Hence remove such check.
      
      Fixes: a8ecb93e ("devlink: Introduce rate nodes")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff99324d
    • Dmytro Linkin's avatar
      devlink: Decrease refcnt of parent rate object on leaf destroy · 1321ed5e
      Dmytro Linkin authored
      Port functions, like SFs, can be deleted by the user when its leaf rate
      object has parent node. In such case node refcnt won't be decreased
      which blocks the node from deletion later.
      Do simple refcnt decrease, since driver in cleanup stage. This:
      1) assumes that driver took proper internal parent unset action;
      2) allows to avoid nested callbacks call and deadlock.
      
      Fixes: d7555984 ("devlink: Allow setting parent node of rate objects")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1321ed5e
    • Xianting Tian's avatar
      virtio_net: Use virtio_find_vqs_ctx() helper · a2f7dc00
      Xianting Tian authored
      virtio_find_vqs_ctx() is defined but never be called currently,
      it is the right place to use it.
      Signed-off-by: default avatarXianting Tian <xianting.tian@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2f7dc00
    • Kuniyuki Iwashima's avatar
      net/tls: Remove the __TLS_DEC_STATS() macro. · 10ed7ce4
      Kuniyuki Iwashima authored
      The commit d26b698d ("net/tls: add skeleton of MIB statistics")
      introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
      not defined also. Let's remove it.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10ed7ce4
    • Kuniyuki Iwashima's avatar
      tcp: Add stats for socket migration. · 55d444b3
      Kuniyuki Iwashima authored
      This commit adds two stats for the socket migration feature to evaluate the
      effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).
      
      If the migration fails because of the own_req race in receiving ACK and
      sending SYN+ACK paths, we do not increment the failure stat. Then another
      CPU is responsible for the req.
      
      Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/Suggested-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55d444b3
    • David Wilder's avatar
      ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM. · 7525de25
      David Wilder authored
      TCP checksums on received packets may be set to NULL by the sender if CSO
      is enabled. The hypervisor flags these packets as check-sum-ok and the
      skb is then flagged CHECKSUM_UNNECESSARY. If these packets are then
      forwarded the sender will not request CSO due to the CHECKSUM_UNNECESSARY
      flag. The result is a TCP packet sent with a bad checksum. This change
      sets up CHECKSUM_PARTIAL on these packets causing the sender to correctly
      request CSUM offload.
      Signed-off-by: default avatarDavid Wilder <dwilder@us.ibm.com>
      Reviewed-by: default avatarPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Tested-by: default avatarCristobal Forno <cforno12@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7525de25
    • David S. Miller's avatar
      Merge tag 'mlx5-net-next-2021-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · fe87797b
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-net-next-2021-06-22
      
      1) Various minor cleanups and fixes from net-next branch
      2) Optimize mlx5 feature check on tx and
         a fix to allow Vxlan with Ipsec offloads
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe87797b
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · a7b62112
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
         from Phil Sutter.
      
      2) Simplify TCP option sanity check for TCP packets, also from Phil.
      
      3) Add a new expression to store when the rule has been used last time.
      
      4) Pass the hook state object to log function, from Florian Westphal.
      
      5) Document the new sysctl knobs to tune the flowtable timeouts,
         from Oz Shlomo.
      
      6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
         from Dan Carpenter.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7b62112
    • Andrea Righi's avatar
      selftests: icmp_redirect: support expected failures · 0a36a75c
      Andrea Righi authored
      According to a comment in commit 99513cfa ("selftest: Fixes for
      icmp_redirect test") the test "IPv6: mtu exception plus redirect" is
      expected to fail, because of a bug in the IPv6 logic that hasn't been
      fixed yet apparently.
      
      We should probably consider this failure as an "expected failure",
      therefore change the script to return XFAIL for that particular test and
      also report the total amount of expected failures at the end of the run.
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a36a75c
    • David S. Miller's avatar
      Merge branch 'lockless-qdisc-opts' · e940eb3c
      David S. Miller authored
      Yunsheng Lin says:
      
      ====================
      Some optimization for lockless qdisc
      
      Patch 1: remove unnecessary seqcount operation.
      Patch 2: implement TCQ_F_CAN_BYPASS.
      Patch 3: remove qdisc->empty.
      
      Performance data for pktgen in queue_xmit mode + dummy netdev
      with pfifo_fast:
      
       threads    unpatched           patched             delta
          1       2.60Mpps            3.21Mpps             +23%
          2       3.84Mpps            5.56Mpps             +44%
          4       5.52Mpps            5.58Mpps             +1%
          8       2.77Mpps            2.76Mpps             -0.3%
         16       2.24Mpps            2.23Mpps             -0.4%
      
      Performance for IP forward testing: 1.05Mpps increases to
      1.16Mpps, about 10% improvement.
      
      V3: Add 'Acked-by' from Jakub and 'Tested-by' from Vladimir,
          and resend based on latest net-next.
      V2: Adjust the comment and commit log according to discussion
          in V1.
      V1: Drop RFC tag, add nolock_qdisc_is_empty() and do the qdisc
          empty checking without the protection of qdisc->seqlock to
          aviod doing unnecessary spin_trylock() for contention case.
      RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
              qdisc, and add patch 1 and 3.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e940eb3c
    • Yunsheng Lin's avatar
      net: sched: remove qdisc->empty for lockless qdisc · d3e0f575
      Yunsheng Lin authored
      As MISSED and DRAINING state are used to indicate a non-empty
      qdisc, qdisc->empty is not longer needed, so remove it.
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3e0f575
    • Yunsheng Lin's avatar
      net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b
      Yunsheng Lin authored
      Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
      flag set, but queue discipline by-pass does not work for lockless
      qdisc because skb is always enqueued to qdisc even when the qdisc
      is empty, see __dev_xmit_skb().
      
      This patch calls sch_direct_xmit() to transmit the skb directly
      to the driver for empty lockless qdisc, which aviod enqueuing
      and dequeuing operation.
      
      As qdisc->empty is not reliable to indicate a empty qdisc because
      there is a time window between enqueuing and setting qdisc->empty.
      So we use the MISSED state added in commit a90c57f2 ("net:
      sched: fix packet stuck problem for lockless qdisc"), which
      indicate there is lock contention, suggesting that it is better
      not to do the qdisc bypass in order to avoid packet out of order
      problem.
      
      In order to make MISSED state reliable to indicate a empty qdisc,
      we need to ensure that testing and clearing of MISSED state is
      within the protection of qdisc->seqlock, only setting MISSED state
      can be done without the protection of qdisc->seqlock. A MISSED
      state testing is added without the protection of qdisc->seqlock to
      aviod doing unnecessary spin_trylock() for contention case.
      
      As the enqueuing is not within the protection of qdisc->seqlock,
      there is still a potential data race as mentioned by Jakub [1]:
      
            thread1               thread2             thread3
      qdisc_run_begin() # true
                              qdisc_run_begin(q)
                                   set(MISSED)
      pfifo_fast_dequeue
        clear(MISSED)
        # recheck the queue
      qdisc_run_end()
                                  enqueue skb1
                                                   qdisc empty # true
                                                qdisc_run_begin() # true
                                                sch_direct_xmit() # skb2
                               qdisc_run_begin()
                                  set(MISSED)
      
      When above happens, skb1 enqueued by thread2 is transmited after
      skb2 is transmited by thread3 because MISSED state setting and
      enqueuing is not under the qdisc->seqlock. If qdisc bypass is
      disabled, skb1 has better chance to be transmited quicker than
      skb2.
      
      This patch does not take care of the above data race, because we
      view this as similar as below:
      Even at the same time CPU1 and CPU2 write the skb to two socket
      which both heading to the same qdisc, there is no guarantee that
      which skb will hit the qdisc first, because there is a lot of
      factor like interrupt/softirq/cache miss/scheduling afffecting
      that.
      
      There are below cases that need special handling:
      1. When MISSED state is cleared before another round of dequeuing
         in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
         dequeue all skb in one round and call __netif_schedule(), which
         might result in a non-empty qdisc without MISSED set. In order
         to avoid this, the MISSED state is set for lockless qdisc and
         __netif_schedule() will be called at the end of qdisc_run_end.
      
      2. The MISSED state also need to be set for lockless qdisc instead
         of calling __netif_schedule() directly when requeuing a skb for
         a similar reason.
      
      3. For netdev queue stopped case, the MISSED case need clearing
         while the netdev queue is stopped, otherwise there may be
         unnecessary __netif_schedule() calling. So a new DRAINING state
         is added to indicate this case, which also indicate a non-empty
         qdisc.
      
      4. As there is already netif_xmit_frozen_or_stopped() checking in
         dequeue_skb() and sch_direct_xmit(), which are both within the
         protection of qdisc->seqlock, but the same checking in
         __dev_xmit_skb() is without the protection, which might cause
         empty indication of a lockless qdisc to be not reliable. So
         remove the checking in __dev_xmit_skb(), and the checking in
         the protection of qdisc->seqlock seems enough to avoid the cpu
         consumption problem for netdev queue stopped case.
      
      1. https://lkml.org/lkml/2021/5/29/215Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4fef01b
    • Yunsheng Lin's avatar
      net: sched: avoid unnecessary seqcount operation for lockless qdisc · dd25296a
      Yunsheng Lin authored
      qdisc->running seqcount operation is mainly used to do heuristic
      locking on q->busylock for locked qdisc, see qdisc_is_running()
      and __dev_xmit_skb().
      
      So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK
      flag.
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd25296a
  3. 22 Jun, 2021 2 commits
    • Huy Nguyen's avatar
      net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload · f1267798
      Huy Nguyen authored
      The packet is VXLAN packet over IPsec transport mode tunnel
      which has the following format: [IP1 | ESP | UDP | VXLAN | IP2 | TCP]
      NVIDIA ConnectX card cannot do checksum offload for two L4 headers.
      The solution is using the checksum partial offload similar to
      VXLAN | TCP packet. Hardware calculates IP1, IP2 and TCP checksums and
      software calculates UDP checksum. However, unlike VXLAN | TCP case,
      IPsec's mlx5 driver cannot access the inner plaintext IP protocol type.
      Therefore, inner_ipproto is added in the sec_path structure
      to provide this information. Also, utilize the skb's csum_start to
      program L4 inner checksum offset.
      
      While at it, remove the call to mlx5e_set_eseg_swp and setup software parser
      fields directly in mlx5e_ipsec_set_swp. mlx5e_set_eseg_swp is not
      needed as the two features (GENEVE and IPsec) are different and adding
      this sharing layer creates unnecessary complexity and affect
      performance.
      
      For the case VXLAN packet over IPsec tunnel mode tunnel, checksum offload
      is disabled because the hardware does not support checksum offload for
      three L3 (IP) headers.
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarHuy Nguyen <huyn@nvidia.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      f1267798
    • Huy Nguyen's avatar
      net/xfrm: Add inner_ipproto into sec_path · fa453523
      Huy Nguyen authored
      The inner_ipproto saves the inner IP protocol of the plain
      text packet. This allows vendor's IPsec feature making offload
      decision at skb's features_check and configuring hardware at
      ndo_start_xmit.
      
      For example, ConnectX6-DX IPsec device needs the plaintext's
      IP protocol to support partial checksum offload on
      VXLAN/GENEVE packet over IPsec transport mode tunnel.
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarHuy Nguyen <huyn@nvidia.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      fa453523