1. 05 Oct, 2023 13 commits
  2. 04 Oct, 2023 20 commits
    • Neal Cardwell's avatar
      tcp: fix delayed ACKs for MSS boundary condition · 4720852e
      Neal Cardwell authored
      This commit fixes poor delayed ACK behavior that can cause poor TCP
      latency in a particular boundary condition: when an application makes
      a TCP socket write that is an exact multiple of the MSS size.
      
      The problem is that there is painful boundary discontinuity in the
      current delayed ACK behavior. With the current delayed ACK behavior,
      we have:
      
      (1) If an app reads data when > 1*MSS is unacknowledged, then
          tcp_cleanup_rbuf() ACKs immediately because of:
      
           tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
      
      (2) If an app reads all received data, and the packets were < 1*MSS,
          and either (a) the app is not ping-pong or (b) we received two
          packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
          of:
      
           ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
            ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
             !inet_csk_in_pingpong_mode(sk))) &&
      
      (3) *However*: if an app reads exactly 1*MSS of data,
          tcp_cleanup_rbuf() does not send an immediate ACK. This is true
          even if the app is not ping-pong and the 1*MSS of data had the PSH
          bit set, suggesting the sending application completed an
          application write.
      
      Thus if the app is not ping-pong, we have this painful case where
      >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
      write whose last skb is an exact multiple of 1*MSS can get a 40ms
      delayed ACK. This means that any app that transfers data in one
      direction and takes care to align write size or packet size with MSS
      can suffer this problem. With receive zero copy making 4KB MSS values
      more common, it is becoming more common to have application writes
      naturally align with MSS, and more applications are likely to
      encounter this delayed ACK problem.
      
      The fix in this commit is to refine the delayed ACK heuristics with a
      simple check: immediately ACK a received 1*MSS skb with PSH bit set if
      the app reads all data. Why? If an skb has a len of exactly 1*MSS and
      has the PSH bit set then it is likely the end of an application
      write. So more data may not be arriving soon, and yet the data sender
      may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
      set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
      an ACK immediately if the app reads all of the data and is not
      ping-pong. Note that this logic is also executed for the case where
      len > MSS, but in that case this logic does not matter (and does not
      hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
      app reads data and there is more than an MSS of unACKed data.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarYuchung Cheng <ycheng@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Xin Guo <guoxin0309@gmail.com>
      Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4720852e
    • Neal Cardwell's avatar
      tcp: fix quick-ack counting to count actual ACKs of new data · 059217c1
      Neal Cardwell authored
      This commit fixes quick-ack counting so that it only considers that a
      quick-ack has been provided if we are sending an ACK that newly
      acknowledges data.
      
      The code was erroneously using the number of data segments in outgoing
      skbs when deciding how many quick-ack credits to remove. This logic
      does not make sense, and could cause poor performance in
      request-response workloads, like RPC traffic, where requests or
      responses can be multi-segment skbs.
      
      When a TCP connection decides to send N quick-acks, that is to
      accelerate the cwnd growth of the congestion control module
      controlling the remote endpoint of the TCP connection. That quick-ack
      decision is purely about the incoming data and outgoing ACKs. It has
      nothing to do with the outgoing data or the size of outgoing data.
      
      And in particular, an ACK only serves the intended purpose of allowing
      the remote congestion control to grow the congestion window quickly if
      the ACK is ACKing or SACKing new data.
      
      The fix is simple: only count packets as serving the goal of the
      quickack mechanism if they are ACKing/SACKing new data. We can tell
      whether this is the case by checking inet_csk_ack_scheduled(), since
      we schedule an ACK exactly when we are ACKing/SACKing new data.
      
      Fixes: fc6415bc ("[TCP]: Fix quick-ack decrementing with TSO.")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarYuchung Cheng <ycheng@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      059217c1
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · c56e67f3
      Jakub Kicinski authored
      Florian Westphal says:
      
      ====================
      netfilter patches for net
      
      First patch resolves a regression with vlan header matching, this was
      broken since 6.5 release.  From myself.
      
      Second patch fixes an ancient problem with sctp connection tracking in
      case INIT_ACK packets are delayed.  This comes with a selftest, both
      patches from Xin Long.
      
      Patch 4 extends the existing nftables audit selftest, from
      Phil Sutter.
      
      Patch 5, also from Phil, avoids a situation where nftables
      would emit an audit record twice. This was broken since 5.13 days.
      
      Patch 6, from myself, avoids spurious insertion failure if we encounter an
      overlapping but expired range during element insertion with the
      'nft_set_rbtree' backend. This problem exists since 6.2.
      
      * tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
        netfilter: nf_tables: Deduplicate nft_register_obj audit logs
        selftests: netfilter: Extend nft_audit.sh
        selftests: netfilter: test for sctp collision processing in nf_conntrack
        netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
        netfilter: nft_payload: rebuild vlan header on h_proto access
      ====================
      
      Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c56e67f3
    • Randy Dunlap's avatar
      page_pool: fix documentation typos · 513dbc10
      Randy Dunlap authored
      Correct grammar for better readability.
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Jesper Dangaard Brouer <hawk@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Link: https://lore.kernel.org/r/20231001003846.29541-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      513dbc10
    • Chengfeng Ye's avatar
      tipc: fix a potential deadlock on &tx->lock · 08e50cf0
      Chengfeng Ye authored
      It seems that tipc_crypto_key_revoke() could be be invoked by
      wokequeue tipc_crypto_work_rx() under process context and
      timer/rx callback under softirq context, thus the lock acquisition
      on &tx->lock seems better use spin_lock_bh() to prevent possible
      deadlock.
      
      This flaw was found by an experimental static analysis tool I am
      developing for irq-related deadlock.
      
      tipc_crypto_work_rx() <workqueue>
      --> tipc_crypto_key_distr()
      --> tipc_bcast_xmit()
      --> tipc_bcbase_xmit()
      --> tipc_bearer_bc_xmit()
      --> tipc_crypto_xmit()
      --> tipc_ehdr_build()
      --> tipc_crypto_key_revoke()
      --> spin_lock(&tx->lock)
      <timer interrupt>
         --> tipc_disc_timeout()
         --> tipc_bearer_xmit_skb()
         --> tipc_crypto_xmit()
         --> tipc_ehdr_build()
         --> tipc_crypto_key_revoke()
         --> spin_lock(&tx->lock) <deadlock here>
      Signed-off-by: default avatarChengfeng Ye <dg573847474@gmail.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Fixes: fc1b6d6d ("tipc: introduce TIPC encryption & authentication")
      Link: https://lore.kernel.org/r/20230927181414.59928-1-dg573847474@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      08e50cf0
    • Ben Wolsieffer's avatar
      net: stmmac: dwmac-stm32: fix resume on STM32 MCU · 6f195d6b
      Ben Wolsieffer authored
      The STM32MP1 keeps clk_rx enabled during suspend, and therefore the
      driver does not enable the clock in stm32_dwmac_init() if the device was
      suspended. The problem is that this same code runs on STM32 MCUs, which
      do disable clk_rx during suspend, causing the clock to never be
      re-enabled on resume.
      
      This patch adds a variant flag to indicate that clk_rx remains enabled
      during suspend, and uses this to decide whether to enable the clock in
      stm32_dwmac_init() if the device was suspended.
      
      This approach fixes this specific bug with limited opportunity for
      unintended side-effects, but I have a follow up patch that will refactor
      the clock configuration and hopefully make it less error prone.
      
      Fixes: 6528e02c ("net: ethernet: stmmac: add adaptation for stm32mp157c.")
      Signed-off-by: default avatarBen Wolsieffer <ben.wolsieffer@hefring.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20230927175749.1419774-1-ben.wolsieffer@hefring.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6f195d6b
    • Benjamin Poirier's avatar
      ipv4: Set offload_failed flag in fibmatch results · 0add5c59
      Benjamin Poirier authored
      Due to a small omission, the offload_failed flag is missing from ipv4
      fibmatch results. Make sure it is set correctly.
      
      The issue can be witnessed using the following commands:
      echo "1 1" > /sys/bus/netdevsim/new_device
      ip link add dummy1 up type dummy
      ip route add 192.0.2.0/24 dev dummy1
      echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload
      ip route add 198.51.100.0/24 dev dummy1
      ip route
      	# 192.168.15.0/24 has rt_trap
      	# 198.51.100.0/24 has rt_offload_failed
      ip route get 192.168.15.1 fibmatch
      	# Result has rt_trap
      ip route get 198.51.100.1 fibmatch
      	# Result differs from the route shown by `ip route`, it is missing
      	# rt_offload_failed
      ip link del dev dummy1
      echo 1 > /sys/bus/netdevsim/del_device
      
      Fixes: 36c5100e ("IPv4: Add "offload failed" indication to routes")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0add5c59
    • Jakub Kicinski's avatar
      Merge tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · 72897b29
      Jakub Kicinski authored
      Johannes Berg says:
      
      ====================
      
      Quite a collection of fixes this time, really too many
      to list individually. Many stack fixes, even rfkill
      (found by simulation and the new eevdf scheduler)!
      
      Also a bigger maintainers file cleanup, to remove old
      and redundant information.
      
      * tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (32 commits)
        wifi: iwlwifi: mvm: Fix incorrect usage of scan API
        wifi: mac80211: Create resources for disabled links
        wifi: cfg80211: avoid leaking stack data into trace
        wifi: mac80211: allow transmitting EAPOL frames with tainted key
        wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
        wifi: cfg80211: Fix 6GHz scan configuration
        wifi: mac80211: fix potential key leak
        wifi: mac80211: fix potential key use-after-free
        wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling
        wifi: brcmfmac: Replace 1-element arrays with flexible arrays
        wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet
        wifi: rtw88: rtw8723d: Fix MAC address offset in EEPROM
        rfkill: sync before userspace visibility/changes
        wifi: mac80211: fix mesh id corruption on 32 bit systems
        wifi: cfg80211: add missing kernel-doc for cqm_rssi_work
        wifi: cfg80211: fix cqm_config access race
        wifi: iwlwifi: mvm: Fix a memory corruption issue
        wifi: iwlwifi: Ensure ack flag is properly cleared.
        wifi: iwlwifi: dbg_ini: fix structure packing
        iwlwifi: mvm: handle PS changes in vif_cfg_changed
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20230927095835.25803-2-johannes@sipsolutions.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      72897b29
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 1eb3dee1
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2023-10-02
      
      We've added 11 non-merge commits during the last 12 day(s) which contain
      a total of 12 files changed, 176 insertions(+), 41 deletions(-).
      
      The main changes are:
      
      1) Fix BPF verifier to reset backtrack_state masks on global function
         exit as otherwise subsequent precision tracking would reuse them,
         from Andrii Nakryiko.
      
      2) Several sockmap fixes for available bytes accounting,
         from John Fastabend.
      
      3) Reject sk_msg egress redirects to non-TCP sockets given this
         is only supported for TCP sockets today, from Jakub Sitnicki.
      
      4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
         limits with BPF_F_BEFORE directive, from Daniel Borkmann
         and Nikolay Aleksandrov.
      
      5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
         size_index for selecting a bpf_mem_cache, from Hou Tao.
      
      6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
         from Song Liu.
      
      7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
         from Leon Hwang.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf: Use kmalloc_size_roundup() to adjust size_index
        selftest/bpf: Add various selftests for program limits
        bpf, mprog: Fix maximum program check on mprog attachment
        bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
        bpf, sockmap: Add tests for MSG_F_PEEK
        bpf, sockmap: Do not inc copied_seq when PEEK flag set
        bpf: tcp_read_skb needs to pop skb regardless of seq
        bpf: unconditionally reset backtrack_state masks on global func exit
        bpf: Fix tr dereferencing
        selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
        s390/bpf: Let arch_prepare_bpf_trampoline return program size
      ====================
      
      Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1eb3dee1
    • Florian Westphal's avatar
      netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure · 08738827
      Florian Westphal authored
      nft_rbtree_gc_elem() walks back and removes the end interval element that
      comes before the expired element.
      
      There is a small chance that we've cached this element as 'rbe_ge'.
      If this happens, we hold and test a pointer that has been queued for
      freeing.
      
      It also causes spurious insertion failures:
      
      $ cat test-testcases-sets-0044interval_overlap_0.1/testout.log
      Error: Could not process rule: File exists
      add element t s {  0 -  2 }
                         ^^^^^^
      Failed to insert  0 -  2 given:
      table ip t {
              set s {
                      type inet_service
                      flags interval,timeout
                      timeout 2s
                      gc-interval 2s
              }
      }
      
      The set (rbtree) is empty. The 'failure' doesn't happen on next attempt.
      
      Reason is that when we try to insert, the tree may hold an expired
      element that collides with the range we're adding.
      While we do evict/erase this element, we can trip over this check:
      
      if (rbe_ge && nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new))
            return -ENOTEMPTY;
      
      rbe_ge was erased by the synchronous gc, we should not have done this
      check.  Next attempt won't find it, so retry results in successful
      insertion.
      
      Restart in-kernel to avoid such spurious errors.
      
      Such restart are rare, unless userspace intentionally adds very large
      numbers of elements with very short timeouts while setting a huge
      gc interval.
      
      Even in this case, this cannot loop forever, on each retry an existing
      element has been removed.
      
      As the caller is holding the transaction mutex, its impossible
      for a second entity to add more expiring elements to the tree.
      
      After this it also becomes feasible to remove the async gc worker
      and perform all garbage collection from the commit path.
      
      Fixes: c9e6978e ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      08738827
    • Phil Sutter's avatar
      netfilter: nf_tables: Deduplicate nft_register_obj audit logs · 0d880dc6
      Phil Sutter authored
      When adding/updating an object, the transaction handler emits suitable
      audit log entries already, the one in nft_obj_notify() is redundant. To
      fix that (and retain the audit logging from objects' 'update' callback),
      Introduce an "audit log free" variant for internal use.
      
      Fixes: c520292f ("audit: log nftables configuration change events once per table")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Reviewed-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      0d880dc6
    • Phil Sutter's avatar
      selftests: netfilter: Extend nft_audit.sh · 203bb9d3
      Phil Sutter authored
      Add tests for sets and elements and deletion of all kinds. Also
      reorder rule reset tests: By moving the bulk rule add command up, the
      two 'reset rules' tests become identical.
      
      While at it, fix for a failing bulk rule add test's error status getting
      lost due to its use in a pipe. Avoid this by using a temporary file.
      
      Headings in diff output for failing tests contain no useful data, strip
      them.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      203bb9d3
    • Xin Long's avatar
      selftests: netfilter: test for sctp collision processing in nf_conntrack · cf791b22
      Xin Long authored
      This patch adds a test case to reproduce the SCTP DATA chunk retransmission
      timeout issue caused by the improper SCTP collision processing in netfilter
      nf_conntrack_proto_sctp.
      
      In this test, client sends a INIT chunk, but the INIT_ACK replied from
      server is delayed until the server sends a INIT chunk to start a new
      connection from its side. After the connection is complete from server
      side, the delayed INIT_ACK arrives in nf_conntrack_proto_sctp.
      
      The delayed INIT_ACK should be dropped in nf_conntrack_proto_sctp instead
      of updating the vtag with the out-of-date init_tag, otherwise, the vtag
      in DATA chunks later sent by client don't match the vtag in the conntrack
      entry and the DATA chunks get dropped.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      cf791b22
    • Xin Long's avatar
      netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp · 8e56b063
      Xin Long authored
      In Scenario A and B below, as the delayed INIT_ACK always changes the peer
      vtag, SCTP ct with the incorrect vtag may cause packet loss.
      
      Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK
      
        192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772]
          192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151]
          192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772]
        192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] *
        192.168.1.2 > 192.168.1.1: [COOKIE ECHO]
          192.168.1.1 > 192.168.1.2: [COOKIE ECHO]
          192.168.1.2 > 192.168.1.1: [COOKIE ACK]
      
      Scenario B: INIT_ACK is delayed until the peer completes its own handshake
      
        192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
          192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
          192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
          192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
          192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
        192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *
      
      This patch fixes it as below:
      
      In SCTP_CID_INIT processing:
      - clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] &&
        ct->proto.sctp.init[!dir]. (Scenario E)
      - set ct->proto.sctp.init[dir].
      
      In SCTP_CID_INIT_ACK processing:
      - drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] &&
        ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C)
      - drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] &&
        ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A)
      
      In SCTP_CID_COOKIE_ACK processing:
      - clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir].
        (Scenario D)
      
      Also, it's important to allow the ct state to move forward with cookie_echo
      and cookie_ack from the opposite dir for the collision scenarios.
      
      There are also other Scenarios where it should allow the packet through,
      addressed by the processing above:
      
      Scenario C: new CT is created by INIT_ACK.
      
      Scenario D: start INIT on the existing ESTABLISHED ct.
      
      Scenario E: start INIT after the old collision on the existing ESTABLISHED
      ct.
      
        192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
        192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
        (both side are stopped, then start new connection again in hours)
        192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742]
      
      Fixes: 9fb9cbb1 ("[NETFILTER]: Add nf_conntrack subsystem.")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      8e56b063
    • Florian Westphal's avatar
      netfilter: nft_payload: rebuild vlan header on h_proto access · af84f9e4
      Florian Westphal authored
      nft can perform merging of adjacent payload requests.
      This means that:
      
      ether saddr 00:11 ... ether type 8021ad ...
      
      is a single payload expression, for 8 bytes, starting at the
      ethernet source offset.
      
      Check that offset+length is fully within the source/destination mac
      addersses.
      
      This bug prevents 'ether type' from matching the correct h_proto in case
      vlan tag got stripped.
      
      Fixes: de6843be ("netfilter: nft_payload: rebuild vlan header when needed")
      Reported-by: default avatarDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      af84f9e4
    • David Wilder's avatar
      ibmveth: Remove condition to recompute TCP header checksum. · 51e7a666
      David Wilder authored
      In some OVS environments the TCP pseudo header checksum may need to be
      recomputed. Currently this is only done when the interface instance is
      configured for "Trunk Mode". We found the issue also occurs in some
      Kubernetes environments, these environments do not use "Trunk Mode",
      therefor the condition is removed.
      
      Performance tests with this change show only a fractional decrease in
      throughput (< 0.2%).
      
      Fixes: 7525de25 ("ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM.")
      Signed-off-by: default avatarDavid Wilder <dwilder@us.ibm.com>
      Reviewed-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51e7a666
    • Dan Carpenter's avatar
      dmaengine: ti: k3-udma-glue: clean up k3_udma_glue_tx_get_irq() return · f9a1d321
      Dan Carpenter authored
      The k3_udma_glue_tx_get_irq() function currently returns negative error
      codes on error, zero on error and positive values for success.  This
      complicates life for the callers who need to propagate the error code.
      Also GCC will not warn about unsigned comparisons when you check:
      
      	if (unsigned_irq <= 0)
      
      All the callers have been fixed now but let's just make this easy going
      forward.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarRoger Quadros <rogerq@kernel.org>
      Acked-by: default avatarVinod Koul <vkoul@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9a1d321
    • Dan Carpenter's avatar
      net: ti: icssg-prueth: Fix signedness bug in prueth_init_tx_chns() · a325f174
      Dan Carpenter authored
      The "tx_chn->irq" variable is unsigned so the error checking does not
      work correctly.
      
      Fixes: 128d5874 ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a325f174
    • Dan Carpenter's avatar
      net: ethernet: ti: am65-cpsw: Fix error code in am65_cpsw_nuss_init_tx_chns() · 37d4f555
      Dan Carpenter authored
      This accidentally returns success, but it should return a negative error
      code.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37d4f555
    • Stefano Garzarella's avatar
      vringh: don't use vringh_kiov_advance() in vringh_iov_xfer() · 7aed44ba
      Stefano Garzarella authored
      In the while loop of vringh_iov_xfer(), `partlen` could be 0 if one of
      the `iov` has 0 lenght.
      In this case, we should skip the iov and go to the next one.
      But calling vringh_kiov_advance() with 0 lenght does not cause the
      advancement, since it returns immediately if asked to advance by 0 bytes.
      
      Let's restore the code that was there before commit b8c06ad4
      ("vringh: implement vringh_kiov_advance()"), avoiding using
      vringh_kiov_advance().
      
      Fixes: b8c06ad4 ("vringh: implement vringh_kiov_advance()")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7aed44ba
  3. 03 Oct, 2023 7 commits
    • Yoshihiro Shimoda's avatar
      rswitch: Fix PHY station management clock setting · a0c55bba
      Yoshihiro Shimoda authored
      Fix the MPIC.PSMCS value following the programming example in the
      section 6.4.2 Management Data Clock (MDC) Setting, Ethernet MAC IP,
      S4 Hardware User Manual Rev.1.00.
      
      The value is calculated by
          MPIC.PSMCS = clk[MHz] / (MDC frequency[MHz] * 2) - 1
      with the input clock frequency from clk_get_rate() and MDC frequency
      of 2.5MHz. Otherwise, this driver cannot communicate PHYs on the R-Car
      S4 Starter Kit board.
      
      Fixes: 3590918b ("net: ethernet: renesas: Add support for "Ethernet Switch"")
      Reported-by: default avatarTam Nguyen <tam.nguyen.xa@renesas.com>
      Signed-off-by: default avatarYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Tested-by: default avatarKuninori Morimoto <kuninori.morimoto.gx@renesas.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20230926123054.3976752-1-yoshihiro.shimoda.uh@renesas.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a0c55bba
    • Jeremy Cline's avatar
      net: nfc: llcp: Add lock when modifying device list · dfc7f7a9
      Jeremy Cline authored
      The device list needs its associated lock held when modifying it, or the
      list could become corrupted, as syzbot discovered.
      
      Reported-and-tested-by: syzbot+c1d0a03d305972dbbe14@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=c1d0a03d305972dbbe14Signed-off-by: default avatarJeremy Cline <jeremy@jcline.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Fixes: 6709d4b7 ("net: nfc: Fix use-after-free caused by nfc_llcp_find_local")
      Link: https://lore.kernel.org/r/20230908235853.1319596-1-jeremy@jcline.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dfc7f7a9
    • Parthiban Veerasooran's avatar
      ethtool: plca: fix plca enable data type while parsing the value · 8957261c
      Parthiban Veerasooran authored
      The ETHTOOL_A_PLCA_ENABLED data type is u8. But while parsing the
      value from the attribute, nla_get_u32() is used in the plca_update_sint()
      function instead of nla_get_u8(). So plca_cfg.enabled variable is updated
      with some garbage value instead of 0 or 1 and always enables plca even
      though plca is disabled through ethtool application. This bug has been
      fixed by parsing the values based on the attributes type in the policy.
      
      Fixes: 8580e16c ("net/ethtool: add netlink interface for the PLCA RS")
      Signed-off-by: default avatarParthiban Veerasooran <Parthiban.Veerasooran@microchip.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20230908044548.5878-1-Parthiban.Veerasooran@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8957261c
    • Gustavo A. R. Silva's avatar
      qed/red_ll2: Fix undefined behavior bug in struct qed_ll2_info · eea03d18
      Gustavo A. R. Silva authored
      The flexible structure (a structure that contains a flexible-array member
      at the end) `qed_ll2_tx_packet` is nested within the second layer of
      `struct qed_ll2_info`:
      
      struct qed_ll2_tx_packet {
      	...
              /* Flexible Array of bds_set determined by max_bds_per_packet */
              struct {
                      struct core_tx_bd *txq_bd;
                      dma_addr_t tx_frag;
                      u16 frag_len;
              } bds_set[];
      };
      
      struct qed_ll2_tx_queue {
      	...
      	struct qed_ll2_tx_packet cur_completing_packet;
      };
      
      struct qed_ll2_info {
      	...
      	struct qed_ll2_tx_queue tx_queue;
              struct qed_ll2_cbs cbs;
      };
      
      The problem is that member `cbs` in `struct qed_ll2_info` is placed just
      after an object of type `struct qed_ll2_tx_queue`, which is in itself
      an implicit flexible structure, which by definition ends in a flexible
      array member, in this case `bds_set`. This causes an undefined behavior
      bug at run-time when dynamic memory is allocated for `bds_set`, which
      could lead to a serious issue if `cbs` in `struct qed_ll2_info` is
      overwritten by the contents of `bds_set`. Notice that the type of `cbs`
      is a structure full of function pointers (and a cookie :) ):
      
      include/linux/qed/qed_ll2_if.h:
      107 typedef
      108 void (*qed_ll2_complete_rx_packet_cb)(void *cxt,
      109                                       struct qed_ll2_comp_rx_data *data);
      110
      111 typedef
      112 void (*qed_ll2_release_rx_packet_cb)(void *cxt,
      113                                      u8 connection_handle,
      114                                      void *cookie,
      115                                      dma_addr_t rx_buf_addr,
      116                                      bool b_last_packet);
      117
      118 typedef
      119 void (*qed_ll2_complete_tx_packet_cb)(void *cxt,
      120                                       u8 connection_handle,
      121                                       void *cookie,
      122                                       dma_addr_t first_frag_addr,
      123                                       bool b_last_fragment,
      124                                       bool b_last_packet);
      125
      126 typedef
      127 void (*qed_ll2_release_tx_packet_cb)(void *cxt,
      128                                      u8 connection_handle,
      129                                      void *cookie,
      130                                      dma_addr_t first_frag_addr,
      131                                      bool b_last_fragment, bool b_last_packet);
      132
      133 typedef
      134 void (*qed_ll2_slowpath_cb)(void *cxt, u8 connection_handle,
      135                             u32 opaque_data_0, u32 opaque_data_1);
      136
      137 struct qed_ll2_cbs {
      138         qed_ll2_complete_rx_packet_cb rx_comp_cb;
      139         qed_ll2_release_rx_packet_cb rx_release_cb;
      140         qed_ll2_complete_tx_packet_cb tx_comp_cb;
      141         qed_ll2_release_tx_packet_cb tx_release_cb;
      142         qed_ll2_slowpath_cb slowpath_cb;
      143         void *cookie;
      144 };
      
      Fix this by moving the declaration of `cbs` to the  middle of its
      containing structure `qed_ll2_info`, preventing it from being
      overwritten by the contents of `bds_set` at run-time.
      
      This bug was introduced in 2017, when `bds_set` was converted to a
      one-element array, and started to be used as a Variable Length Object
      (VLO) at run-time.
      
      Fixes: f5823fe6 ("qed: Add ll2 option to limit the number of bds per packet")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/ZQ+Nz8DfPg56pIzr@workSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      eea03d18
    • Shigeru Yoshida's avatar
      net: usb: smsc75xx: Fix uninit-value access in __smsc75xx_read_reg · e9c65989
      Shigeru Yoshida authored
      syzbot reported the following uninit-value access issue:
      
      =====================================================
      BUG: KMSAN: uninit-value in smsc75xx_wait_ready drivers/net/usb/smsc75xx.c:975 [inline]
      BUG: KMSAN: uninit-value in smsc75xx_bind+0x5c9/0x11e0 drivers/net/usb/smsc75xx.c:1482
      CPU: 0 PID: 8696 Comm: kworker/0:3 Not tainted 5.8.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: usb_hub_wq hub_event
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x21c/0x280 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       smsc75xx_wait_ready drivers/net/usb/smsc75xx.c:975 [inline]
       smsc75xx_bind+0x5c9/0x11e0 drivers/net/usb/smsc75xx.c:1482
       usbnet_probe+0x1152/0x3f90 drivers/net/usb/usbnet.c:1737
       usb_probe_interface+0xece/0x1550 drivers/usb/core/driver.c:374
       really_probe+0xf20/0x20b0 drivers/base/dd.c:529
       driver_probe_device+0x293/0x390 drivers/base/dd.c:701
       __device_attach_driver+0x63f/0x830 drivers/base/dd.c:807
       bus_for_each_drv+0x2ca/0x3f0 drivers/base/bus.c:431
       __device_attach+0x4e2/0x7f0 drivers/base/dd.c:873
       device_initial_probe+0x4a/0x60 drivers/base/dd.c:920
       bus_probe_device+0x177/0x3d0 drivers/base/bus.c:491
       device_add+0x3b0e/0x40d0 drivers/base/core.c:2680
       usb_set_configuration+0x380f/0x3f10 drivers/usb/core/message.c:2032
       usb_generic_driver_probe+0x138/0x300 drivers/usb/core/generic.c:241
       usb_probe_device+0x311/0x490 drivers/usb/core/driver.c:272
       really_probe+0xf20/0x20b0 drivers/base/dd.c:529
       driver_probe_device+0x293/0x390 drivers/base/dd.c:701
       __device_attach_driver+0x63f/0x830 drivers/base/dd.c:807
       bus_for_each_drv+0x2ca/0x3f0 drivers/base/bus.c:431
       __device_attach+0x4e2/0x7f0 drivers/base/dd.c:873
       device_initial_probe+0x4a/0x60 drivers/base/dd.c:920
       bus_probe_device+0x177/0x3d0 drivers/base/bus.c:491
       device_add+0x3b0e/0x40d0 drivers/base/core.c:2680
       usb_new_device+0x1bd4/0x2a30 drivers/usb/core/hub.c:2554
       hub_port_connect drivers/usb/core/hub.c:5208 [inline]
       hub_port_connect_change drivers/usb/core/hub.c:5348 [inline]
       port_event drivers/usb/core/hub.c:5494 [inline]
       hub_event+0x5e7b/0x8a70 drivers/usb/core/hub.c:5576
       process_one_work+0x1688/0x2140 kernel/workqueue.c:2269
       worker_thread+0x10bc/0x2730 kernel/workqueue.c:2415
       kthread+0x551/0x590 kernel/kthread.c:292
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293
      
      Local variable ----buf.i87@smsc75xx_bind created at:
       __smsc75xx_read_reg drivers/net/usb/smsc75xx.c:83 [inline]
       smsc75xx_wait_ready drivers/net/usb/smsc75xx.c:968 [inline]
       smsc75xx_bind+0x485/0x11e0 drivers/net/usb/smsc75xx.c:1482
       __smsc75xx_read_reg drivers/net/usb/smsc75xx.c:83 [inline]
       smsc75xx_wait_ready drivers/net/usb/smsc75xx.c:968 [inline]
       smsc75xx_bind+0x485/0x11e0 drivers/net/usb/smsc75xx.c:1482
      
      This issue is caused because usbnet_read_cmd() reads less bytes than requested
      (zero byte in the reproducer). In this case, 'buf' is not properly filled.
      
      This patch fixes the issue by returning -ENODATA if usbnet_read_cmd() reads
      less bytes than requested.
      
      Fixes: d0cad871 ("smsc75xx: SMSC LAN75xx USB gigabit ethernet adapter driver")
      Reported-and-tested-by: syzbot+6966546b78d050bb0b5d@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=6966546b78d050bb0b5dSigned-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230923173549.3284502-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e9c65989
    • Ilya Maximets's avatar
      ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling · 9593c7cb
      Ilya Maximets authored
      Commit b0e214d2 ("netfilter: keep conntrack reference until
      IPsecv6 policy checks are done") is a direct copy of the old
      commit b59c2701 ("[NETFILTER]: Keep conntrack reference until
      IPsec policy checks are done") but for IPv6.  However, it also
      copies a bug that this old commit had.  That is: when the third
      packet of 3WHS connection establishment contains payload, it is
      added into socket receive queue without the XFRM check and the
      drop of connection tracking context.
      
      That leads to nf_conntrack module being impossible to unload as
      it waits for all the conntrack references to be dropped while
      the packet release is deferred in per-cpu cache indefinitely, if
      not consumed by the application.
      
      The issue for IPv4 was fixed in commit 6f0012e3 ("tcp: add a
      missing nf_reset_ct() in 3WHS handling") by adding a missing XFRM
      check and correctly dropping the conntrack context.  However, the
      issue was introduced to IPv6 code afterwards.  Fixing it the
      same way for IPv6 now.
      
      Fixes: b0e214d2 ("netfilter: keep conntrack reference until IPsecv6 policy checks are done")
      Link: https://lore.kernel.org/netdev/d589a999-d4dd-2768-b2d5-89dec64a4a42@ovn.org/Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20230922210530.2045146-1-i.maximets@ovn.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9593c7cb
    • Hangbin Liu's avatar
      ipv4/fib: send notify when delete source address routes · 4b2b6060
      Hangbin Liu authored
      After deleting an interface address in fib_del_ifaddr(), the function
      scans the fib_info list for stray entries and calls fib_flush() and
      fib_table_flush(). Then the stray entries will be deleted silently and no
      RTM_DELROUTE notification will be sent.
      
      This lack of notification can make routing daemons, or monitor like
      `ip monitor route` miss the routing changes. e.g.
      
      + ip link add dummy1 type dummy
      + ip link add dummy2 type dummy
      + ip link set dummy1 up
      + ip link set dummy2 up
      + ip addr add 192.168.5.5/24 dev dummy1
      + ip route add 7.7.7.0/24 dev dummy2 src 192.168.5.5
      + ip -4 route
      7.7.7.0/24 dev dummy2 scope link src 192.168.5.5
      192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5
      + ip monitor route
      + ip addr del 192.168.5.5/24 dev dummy1
      Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5
      Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5
      Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5
      
      As Ido reminded, fib_table_flush() isn't only called when an address is
      deleted, but also when an interface is deleted or put down. The lack of
      notification in these cases is deliberate. And commit 7c6bb7d2
      ("net/ipv6: Add knob to skip DELROUTE message on device down") introduced
      a sysctl to make IPv6 behave like IPv4 in this regard. So we can't send
      the route delete notify blindly in fib_table_flush().
      
      To fix this issue, let's add a new flag in "struct fib_info" to track the
      deleted prefer source address routes, and only send notify for them.
      
      After update:
      + ip monitor route
      + ip addr del 192.168.5.5/24 dev dummy1
      Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5
      Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5
      Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5
      Deleted 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5
      Suggested-by: default avatarThomas Haller <thaller@redhat.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230922075508.848925-1-liuhangbin@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4b2b6060