1. 11 Apr, 2024 20 commits
    • Michal Luczaj's avatar
      af_unix: Fix garbage collector racing against connect() · 47d8ac01
      Michal Luczaj authored
      Garbage collector does not take into account the risk of embryo getting
      enqueued during the garbage collection. If such embryo has a peer that
      carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
      different set of children. Leading to an incorrectly elevated inflight
      count, and then a dangling pointer within the gc_inflight_list.
      
      sockets are AF_UNIX/SOCK_STREAM
      S is an unconnected socket
      L is a listening in-flight socket bound to addr, not in fdtable
      V's fd will be passed via sendmsg(), gets inflight count bumped
      
      connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
      ----------------	-------------------------	-----------
      
      NS = unix_create1()
      skb1 = sock_wmalloc(NS)
      L = unix_find_other(addr)
      unix_state_lock(L)
      unix_peer(S) = NS
      			// V count=1 inflight=0
      
       			NS = unix_peer(S)
       			skb2 = sock_alloc()
      			skb_queue_tail(NS, skb2[V])
      
      			// V became in-flight
      			// V count=2 inflight=1
      
      			close(V)
      
      			// V count=1 inflight=1
      			// GC candidate condition met
      
      						for u in gc_inflight_list:
      						  if (total_refs == inflight_refs)
      						    add u to gc_candidates
      
      						// gc_candidates={L, V}
      
      						for u in gc_candidates:
      						  scan_children(u, dec_inflight)
      
      						// embryo (skb1) was not
      						// reachable from L yet, so V's
      						// inflight remains unchanged
      __skb_queue_tail(L, skb1)
      unix_state_unlock(L)
      						for u in gc_candidates:
      						  if (u.inflight)
      						    scan_children(u, inc_inflight_move_tail)
      
      						// V count=1 inflight=2 (!)
      
      If there is a GC-candidate listening socket, lock/unlock its state. This
      makes GC wait until the end of any ongoing connect() to that socket. After
      flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
      there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
      this point, unix_inflight() can not happen because unix_gc_lock is already
      taken. Inflight graph remains unaffected.
      
      Fixes: 1fd05ba5 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.coSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      47d8ac01
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: trap link-local frames regardless of ST Port State · 17c56011
      Arınç ÜNAL authored
      In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
      (DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
      are described; the medium access control (MAC) and logical link control
      (LLC) sublayers. The MAC sublayer is the one facing the physical layer.
      
      In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
      Bridge component comprises a MAC Relay Entity for interconnecting the Ports
      of the Bridge, at least two Ports, and higher layer entities with at least
      a Spanning Tree Protocol Entity included.
      
      Each Bridge Port also functions as an end station and shall provide the MAC
      Service to an LLC Entity. Each instance of the MAC Service is provided to a
      distinct LLC Entity that supports protocol identification, multiplexing,
      and demultiplexing, for protocol data unit (PDU) transmission and reception
      by one or more higher layer entities.
      
      It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
      Entity associated with each Bridge Port is modeled as being directly
      connected to the attached Local Area Network (LAN).
      
      On the switch with CPU port architecture, CPU port functions as Management
      Port, and the Management Port functionality is provided by software which
      functions as an end station. Software is connected to an IEEE 802 LAN that
      is wholly contained within the system that incorporates the Bridge.
      Software provides access to the LLC Entity associated with each Bridge Port
      by the value of the source port field on the special tag on the frame
      received by software.
      
      We call frames that carry control information to determine the active
      topology and current extent of each Virtual Local Area Network (VLAN),
      i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
      Registration Protocol Data Units (MVRPDUs), and frames from other link
      constrained protocols, such as Extensible Authentication Protocol over LAN
      (EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
      are not forwarded by a Bridge. Permanently configured entries in the
      filtering database (FDB) ensure that such frames are discarded by the
      Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
      detail:
      
      Each of the reserved MAC addresses specified in Table 8-1
      (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
      permanently configured in the FDB in C-VLAN components and ERs.
      
      Each of the reserved MAC addresses specified in Table 8-2
      (01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
      configured in the FDB in S-VLAN components.
      
      Each of the reserved MAC addresses specified in Table 8-3
      (01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
      in TPMR components.
      
      The FDB entries for reserved MAC addresses shall specify filtering for all
      Bridge Ports and all VIDs. Management shall not provide the capability to
      modify or remove entries for reserved MAC addresses.
      
      The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
      propagation of PDUs within a Bridged Network, as follows:
      
        The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
        no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
        component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
        PDUs transmitted using this destination address, or any other addresses
        that appear in Table 8-1, Table 8-2, and Table 8-3
        (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
        therefore travel no further than those stations that can be reached via a
        single individual LAN from the originating station.
      
        The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
        address that no conformant S-VLAN component, C-VLAN component, or MAC
        Bridge can forward; however, this address is relayed by a TPMR component.
        PDUs using this destination address, or any of the other addresses that
        appear in both Table 8-1 and Table 8-2 but not in Table 8-3
        (01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
        by any TPMRs but will propagate no further than the nearest S-VLAN
        component, C-VLAN component, or MAC Bridge.
      
        The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
        address that no conformant C-VLAN component, MAC Bridge can forward;
        however, it is relayed by TPMR components and S-VLAN components. PDUs
        using this destination address, or any of the other addresses that appear
        in Table 8-1 but not in either Table 8-2 or Table 8-3
        (01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
        S-VLAN components but will propagate no further than the nearest C-VLAN
        component or MAC Bridge.
      
      Because the LLC Entity associated with each Bridge Port is provided via CPU
      port, we must not filter these frames but forward them to CPU port.
      
      In a Bridge, the transmission Port is majorly decided by ingress and egress
      rules, FDB, and spanning tree Port State functions of the Forwarding
      Process. For link-local frames, only CPU port should be designated as
      destination port in the FDB, and the other functions of the Forwarding
      Process must not interfere with the decision of the transmission Port. We
      call this process trapping frames to CPU port.
      
      Therefore, on the switch with CPU port architecture, link-local frames must
      be trapped to CPU port, and certain link-local frames received by a Port of
      a Bridge comprising a TPMR component or an S-VLAN component must be
      excluded from it.
      
      A Bridge of the switch with CPU port architecture cannot comprise a
      Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
      subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
      (Management Port doesn't count) of this architecture will either function
      as a standard MAC Bridge or a standard VLAN Bridge.
      
      Therefore, a Bridge of this architecture can only comprise S-VLAN
      components, C-VLAN components, or MAC Bridge components. Since there's no
      TPMR component, we don't need to relay PDUs using the destination addresses
      specified on the Nearest non-TPMR section, and the proportion of the
      Nearest Customer Bridge section where they must be relayed by TPMR
      components.
      
      One option to trap link-local frames to CPU port is to add static FDB
      entries with CPU port designated as destination port. However, because that
      Independent VLAN Learning (IVL) is being used on every VID, each entry only
      applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
      Bridge component or a C-VLAN component, there would have to be 16 times
      4096 entries. This switch intellectual property can only hold a maximum of
      2048 entries. Using this option, there also isn't a mechanism to prevent
      link-local frames from being discarded when the spanning tree Port State of
      the reception Port is discarding.
      
      The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
      registers. Whilst this applies to every VID, it doesn't contain all of the
      reserved MAC addresses without affecting the remaining Standard Group MAC
      Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
      the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
      addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
      destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
      The latter option provides better but not complete conformance.
      
      This switch intellectual property also does not provide a mechanism to trap
      link-local frames with specific destination addresses to CPU port by
      Bridge, to conform to the filtering rules for the distinct Bridge
      components.
      
      Therefore, regardless of the type of the Bridge component, link-local
      frames with these destination addresses will be trapped to CPU port:
      
      01-80-C2-00-00-[00,01,02,03,0E]
      
      In a Bridge comprising a MAC Bridge component or a C-VLAN component:
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]
      
      In a Bridge comprising an S-VLAN component:
      
        Link-local frames with these destination addresses will be trapped to CPU
        port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-00
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A]
      
      Currently on this switch intellectual property, if the spanning tree Port
      State of the reception Port is discarding, link-local frames will be
      discarded.
      
      To trap link-local frames regardless of the spanning tree Port State, make
      the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
      intellectual property only lets the frames regarded as BPDUs bypass the
      spanning tree Port State function of the Forwarding Process.
      
      With this change, the only remaining interference is the ingress rules.
      When the reception Port has no PVID assigned on software, VLAN-untagged
      frames won't be allowed in. There doesn't seem to be a mechanism on the
      switch intellectual property to have link-local frames bypass this function
      of the Forwarding Process.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Reviewed-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Link: https://lore.kernel.org/r/20240409-b4-for-net-mt7530-fix-link-local-when-stp-discarding-v2-1-07b1150164ac@arinc9.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      17c56011
    • Gerd Bayer's avatar
      Revert "s390/ism: fix receive message buffer allocation" · d51dc8dd
      Gerd Bayer authored
      This reverts commit 58effa34.
      Review was not finished on this patch. So it's not ready for
      upstreaming.
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Link: https://lore.kernel.org/r/20240409113753.2181368-1-gbayer@linux.ibm.com
      Fixes: 58effa34 ("s390/ism: fix receive message buffer allocation")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d51dc8dd
    • Daniel Machon's avatar
      net: sparx5: fix wrong config being used when reconfiguring PCS · 33623113
      Daniel Machon authored
      The wrong port config is being used if the PCS is reconfigured. Fix this
      by correctly using the new config instead of the old one.
      
      Fixes: 946e7fd5 ("net: sparx5: add port module support")
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409-link-mode-reconfiguration-fix-v2-1-db6a507f3627@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      33623113
    • Arnd Bergmann's avatar
      net/mlx5: fix possible stack overflows · fe87922c
      Arnd Bergmann authored
      A couple of debug functions use a 512 byte temporary buffer and call another
      function that has another buffer of the same size, which in turn exceeds the
      usual warning limit for excessive stack usage:
      
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1073:1: error: stack frame size (1448) exceeds limit (1024) in 'dr_dump_start' [-Werror,-Wframe-larger-than]
      dr_dump_start(struct seq_file *file, loff_t *pos)
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1009:1: error: stack frame size (1120) exceeds limit (1024) in 'dr_dump_domain' [-Werror,-Wframe-larger-than]
      dr_dump_domain(struct seq_file *file, struct mlx5dr_domain *dmn)
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:705:1: error: stack frame size (1104) exceeds limit (1024) in 'dr_dump_matcher_rx_tx' [-Werror,-Wframe-larger-than]
      dr_dump_matcher_rx_tx(struct seq_file *file, bool is_rx,
      
      Rework these so that each of the various code paths only ever has one of
      these buffers in it, and exactly the functions that declare one have
      the 'noinline_for_stack' annotation that prevents them from all being
      inlined into the same caller.
      
      Fixes: 917d1e79 ("net/mlx5: DR, Change SWS usage to debug fs seq_file interface")
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/all/20240219100506.648089-1-arnd@kernel.org/Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240408074142.3007036-1-arnd@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fe87922c
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-misc-fixes' · 186abfcd
      Jakub Kicinski authored
      Tariq Toukan says:
      
      ====================
      mlx5 misc fixes
      
      This patchset provides bug fixes to mlx5 driver.
      
      This is V2 of the series previously submitted as PR by Saeed:
      https://lore.kernel.org/netdev/20240326144646.2078893-1-saeed@kernel.org/T/
      
      Series generated against:
      commit 237f3cf1 ("xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING")
      ====================
      
      Link: https://lore.kernel.org/r/20240409190820.227554-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      186abfcd
    • Tariq Toukan's avatar
      net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev · 7772dc74
      Tariq Toukan authored
      Adaptations need to be made for the auxiliary device management in the
      core driver level. Block this combination for now.
      
      Fixes: 678eb448 ("net/mlx5: SD, Implement basic query and instantiation")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-12-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7772dc74
    • Carolina Jubran's avatar
      net/mlx5e: RSS, Block XOR hash with over 128 channels · 49e6c938
      Carolina Jubran authored
      When supporting more than 128 channels, the RQT size is
      calculated by multiplying the number of channels by 2
      and rounding up to the nearest power of 2.
      
      The index of the RQT is derived from the RSS hash
      calculations. If XOR8 is used as the RSS hash function,
      there are only 256 possible hash results, and therefore,
      only 256 indexes can be reached in the RQT.
      
      Block setting the RSS hash function to XOR when the number
      of channels exceeds 128.
      
      Fixes: 74a8dada ("net/mlx5e: Preparations for supporting larger number of channels")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-11-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      49e6c938
    • Rahul Rameshbabu's avatar
      net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit · 86b0ca5b
      Rahul Rameshbabu authored
      Free Tx port timestamping metadata entries in the NAPI poll context and
      consume metadata enties in the WQE xmit path. Do not free a Tx port
      timestamping metadata entry in the WQE xmit path even in the error path to
      avoid a race between two metadata entry producers.
      
      Fixes: 3178308a ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-10-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      86b0ca5b
    • Carolina Jubran's avatar
      net/mlx5e: HTB, Fix inconsistencies with QoS SQs number · 2f436f18
      Carolina Jubran authored
      When creating a new HTB class while the interface is down,
      the variable that follows the number of QoS SQs (htb_max_qos_sqs)
      may not be consistent with the number of HTB classes.
      
      Previously, we compared these two values to ensure that
      the node_qid is lower than the number of QoS SQs, and we
      allocated stats for that SQ when they are equal.
      
      Change the check to compare the node_qid with the current
      number of leaf nodes and fix the checking conditions to
      ensure allocation of stats_list and stats for each node.
      
      Fixes: 214baf22 ("net/mlx5e: Support HTB offload")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-9-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2f436f18
    • Carolina Jubran's avatar
      net/mlx5e: Fix mlx5e_priv_init() cleanup flow · ecb82945
      Carolina Jubran authored
      When mlx5e_priv_init() fails, the cleanup flow calls mlx5e_selq_cleanup which
      calls mlx5e_selq_apply() that assures that the `priv->state_lock` is held using
      lockdep_is_held().
      
      Acquire the state_lock in mlx5e_selq_cleanup().
      
      Kernel log:
      =============================
      WARNING: suspicious RCU usage
      6.8.0-rc3_net_next_841a9b5 #1 Not tainted
      -----------------------------
      drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:124 suspicious rcu_dereference_protected() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      2 locks held by systemd-modules/293:
       #0: ffffffffa05067b0 (devices_rwsem){++++}-{3:3}, at: ib_register_client+0x109/0x1b0 [ib_core]
       #1: ffff8881096c65c0 (&device->client_data_rwsem){++++}-{3:3}, at: add_client_context+0x104/0x1c0 [ib_core]
      
      stack backtrace:
      CPU: 4 PID: 293 Comm: systemd-modules Not tainted 6.8.0-rc3_net_next_841a9b5 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x8a/0xa0
       lockdep_rcu_suspicious+0x154/0x1a0
       mlx5e_selq_apply+0x94/0xa0 [mlx5_core]
       mlx5e_selq_cleanup+0x3a/0x60 [mlx5_core]
       mlx5e_priv_init+0x2be/0x2f0 [mlx5_core]
       mlx5_rdma_setup_rn+0x7c/0x1a0 [mlx5_core]
       rdma_init_netdev+0x4e/0x80 [ib_core]
       ? mlx5_rdma_netdev_free+0x70/0x70 [mlx5_core]
       ipoib_intf_init+0x64/0x550 [ib_ipoib]
       ipoib_intf_alloc+0x4e/0xc0 [ib_ipoib]
       ipoib_add_one+0xb0/0x360 [ib_ipoib]
       add_client_context+0x112/0x1c0 [ib_core]
       ib_register_client+0x166/0x1b0 [ib_core]
       ? 0xffffffffa0573000
       ipoib_init_module+0xeb/0x1a0 [ib_ipoib]
       do_one_initcall+0x61/0x250
       do_init_module+0x8a/0x270
       init_module_from_file+0x8b/0xd0
       idempotent_init_module+0x17d/0x230
       __x64_sys_finit_module+0x61/0xb0
       do_syscall_64+0x71/0x140
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
       </TASK>
      
      Fixes: 8bf30be7 ("net/mlx5e: Introduce select queue parameters")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-8-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ecb82945
    • Carolina Jubran's avatar
      net/mlx5e: RSS, Block changing channels number when RXFH is configured · ee357240
      Carolina Jubran authored
      Changing the channels number after configuring the receive flow hash
      indirection table may affect the RSS table size. The previous
      configuration may no longer be compatible with the new receive flow
      hash indirection table.
      
      Block changing the channels number when RXFH is configured and changing
      the channels number requires resizing the RSS table size.
      
      Fixes: 74a8dada ("net/mlx5e: Preparations for supporting larger number of channels")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-7-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ee357240
    • Cosmin Ratiu's avatar
      net/mlx5: Correctly compare pkt reformat ids · 9eca93f4
      Cosmin Ratiu authored
      struct mlx5_pkt_reformat contains a naked union of a u32 id and a
      dr_action pointer which is used when the action is SW-managed (when
      pkt_reformat.owner is set to MLX5_FLOW_RESOURCE_OWNER_SW). Using id
      directly in that case is incorrect, as it maps to the least significant
      32 bits of the 64-bit pointer in mlx5_fs_dr_action and not to the pkt
      reformat id allocated in firmware.
      
      For the purpose of comparing whether two rules are identical,
      interpreting the least significant 32 bits of the mlx5_fs_dr_action
      pointer as an id mostly works... until it breaks horribly and produces
      the outcome described in [1].
      
      This patch fixes mlx5_flow_dests_cmp to correctly compare ids using
      mlx5_fs_dr_action_get_pkt_reformat_id for the SW-managed rules.
      
      Link: https://lore.kernel.org/netdev/ea5264d6-6b55-4449-a602-214c6f509c1e@163.com/T/#u [1]
      
      Fixes: 6a48faee ("net/mlx5: Add direct rule fs_cmd implementation")
      Signed-off-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-6-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9eca93f4
    • Cosmin Ratiu's avatar
      net/mlx5: Properly link new fs rules into the tree · 7c6782ad
      Cosmin Ratiu authored
      Previously, add_rule_fg would only add newly created rules from the
      handle into the tree when they had a refcount of 1. On the other hand,
      create_flow_handle tries hard to find and reference already existing
      identical rules instead of creating new ones.
      
      These two behaviors can result in a situation where create_flow_handle
      1) creates a new rule and references it, then
      2) in a subsequent step during the same handle creation references it
         again,
      resulting in a rule with a refcount of 2 that is not linked into the
      tree, will have a NULL parent and root and will result in a crash when
      the flow group is deleted because del_sw_hw_rule, invoked on rule
      deletion, assumes node->parent is != NULL.
      
      This happened in the wild, due to another bug related to incorrect
      handling of duplicate pkt_reformat ids, which lead to the code in
      create_flow_handle incorrectly referencing a just-added rule in the same
      flow handle, resulting in the problem described above. Full details are
      at [1].
      
      This patch changes add_rule_fg to add new rules without parents into
      the tree, properly initializing them and avoiding the crash. This makes
      it more consistent with how rules are added to an FTE in
      create_flow_handle.
      
      Fixes: 74491de9 ("net/mlx5: Add multi dest support")
      Link: https://lore.kernel.org/netdev/ea5264d6-6b55-4449-a602-214c6f509c1e@163.com/T/#u [1]
      Signed-off-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-5-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7c6782ad
    • Michael Liang's avatar
      net/mlx5: offset comp irq index in name by one · 9f7e8fbb
      Michael Liang authored
      The mlx5 comp irq name scheme is changed a little bit between
      commit 3663ad34 ("net/mlx5: Shift control IRQ to the last index")
      and commit 3354822c ("net/mlx5: Use dynamic msix vectors allocation").
      The index in the comp irq name used to start from 0 but now it starts
      from 1. There is nothing critical here, but it's harmless to change
      back to the old behavior, a.k.a starting from 0.
      
      Fixes: 3354822c ("net/mlx5: Use dynamic msix vectors allocation")
      Reviewed-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Reviewed-by: default avatarYuanyuan Zhong <yzhong@purestorage.com>
      Signed-off-by: default avatarMichael Liang <mliang@purestorage.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-4-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9f7e8fbb
    • Shay Drory's avatar
      net/mlx5: Register devlink first under devlink lock · c6e77aa9
      Shay Drory authored
      In case device is having a non fatal FW error during probe, the
      driver will report the error to user via devlink. This will trigger
      a WARN_ON, since mlx5 is calling devlink_register() last.
      In order to avoid the WARN_ON[1], change mlx5 to invoke devl_register()
      first under devlink lock.
      
      [1]
      WARNING: CPU: 5 PID: 227 at net/devlink/health.c:483 devlink_recover_notify.constprop.0+0xb8/0xc0
      CPU: 5 PID: 227 Comm: kworker/u16:3 Not tainted 6.4.0-rc5_for_upstream_min_debug_2023_06_12_12_38 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Workqueue: mlx5_health0000:08:00.0 mlx5_fw_reporter_err_work [mlx5_core]
      RIP: 0010:devlink_recover_notify.constprop.0+0xb8/0xc0
      Call Trace:
       <TASK>
       ? __warn+0x79/0x120
       ? devlink_recover_notify.constprop.0+0xb8/0xc0
       ? report_bug+0x17c/0x190
       ? handle_bug+0x3c/0x60
       ? exc_invalid_op+0x14/0x70
       ? asm_exc_invalid_op+0x16/0x20
       ? devlink_recover_notify.constprop.0+0xb8/0xc0
       devlink_health_report+0x4a/0x1c0
       mlx5_fw_reporter_err_work+0xa4/0xd0 [mlx5_core]
       process_one_work+0x1bb/0x3c0
       ? process_one_work+0x3c0/0x3c0
       worker_thread+0x4d/0x3c0
       ? process_one_work+0x3c0/0x3c0
       kthread+0xc6/0xf0
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x1f/0x30
       </TASK>
      
      Fixes: cf530217 ("devlink: Notify users when objects are accessible")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-3-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c6e77aa9
    • Shay Drory's avatar
      net/mlx5: E-switch, store eswitch pointer before registering devlink_param · 0553e753
      Shay Drory authored
      Next patch will move devlink register to be first. Therefore, whenever
      mlx5 will register a param, the user will be notified.
      In order to notify the user, devlink is using the get() callback of
      the param. Hence, resources that are being used by the get() callback
      must be set before the devlink param is registered.
      
      Therefore, store eswitch pointer inside mdev before registering the
      param.
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-2-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0553e753
    • Eric Dumazet's avatar
      netfilter: complete validation of user input · 65acf6e0
      Eric Dumazet authored
      In my recent commit, I missed that do_replace() handlers
      use copy_from_sockptr() (which I fixed), followed
      by unsafe copy_from_sockptr_offset() calls.
      
      In all functions, we can perform the @optlen validation
      before even calling xt_alloc_table_info() with the following
      check:
      
      if ((u64)optlen < (u64)tmp.size + sizeof(tmp))
              return -EINVAL;
      
      Fixes: 0c83842d ("netfilter: validate user input for expected length")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Link: https://lore.kernel.org/r/20240409120741.3538135-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65acf6e0
    • Heiner Kallweit's avatar
      r8169: add missing conditional compiling for call to r8169_remove_leds · 97e176fc
      Heiner Kallweit authored
      Add missing dependency on CONFIG_R8169_LEDS. As-is a link error occurs
      if config option CONFIG_R8169_LEDS isn't enabled.
      
      Fixes: 19fa4f2a ("r8169: fix LED-related deadlock on module removal")
      Reported-by: default avatarVenkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Tested-By: default avatarVenkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
      Link: https://lore.kernel.org/r/d080038c-eb6b-45ac-9237-b8c1cdd7870f@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      97e176fc
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: fix enabling EEE on MT7531 switch on all boards · 06dfcd40
      Arınç ÜNAL authored
      The commit 40b5d2f1 ("net: dsa: mt7530: Add support for EEE features")
      brought EEE support but did not enable EEE on MT7531 switch MACs. EEE is
      enabled on MT7531 switch MACs by pulling the LAN2LED0 pin low on the board
      (bootstrapping), unsetting the EEE_DIS bit on the trap register, or setting
      the internal EEE switch bit on the CORE_PLL_GROUP4 register. Thanks to
      SkyLake Huang (黃啟澤) from MediaTek for providing information on the
      internal EEE switch bit.
      
      There are existing boards that were not designed to pull the pin low.
      Because of that, the EEE status currently depends on the board design.
      
      The EEE_DIS bit on the trap pertains to the LAN2LED0 pin which is usually
      used to control an LED. Once the bit is unset, the pin will be low. That
      will make the active low LED turn on. The pin is controlled by the switch
      PHY. It seems that the PHY controls the pin in the way that it inverts the
      pin state. That means depending on the wiring of the LED connected to
      LAN2LED0 on the board, the LED may be on without an active link.
      
      To not cause this unwanted behaviour whilst enabling EEE on all boards, set
      the internal EEE switch bit on the CORE_PLL_GROUP4 register.
      
      My testing on MT7531 shows a certain amount of traffic loss when EEE is
      enabled. That said, I haven't come across a board that enables EEE. So
      enable EEE on the switch MACs but disable EEE advertisement on the switch
      PHYs. This way, we don't change the behaviour of the majority of the boards
      that have this switch. The mediatek-ge PHY driver already disables EEE
      advertisement on the switch PHYs but my testing shows that it is somehow
      enabled afterwards. Disabling EEE advertisement before the PHY driver
      initialises keeps it off.
      
      With this change, EEE can now be enabled using ethtool.
      
      Fixes: 40b5d2f1 ("net: dsa: mt7530: Add support for EEE features")
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Tested-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Reviewed-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Link: https://lore.kernel.org/r/20240408-for-net-mt7530-fix-eee-for-mt7531-mt7988-v3-1-84fdef1f008b@arinc9.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      06dfcd40
  2. 10 Apr, 2024 7 commits
    • Heiner Kallweit's avatar
      r8169: fix LED-related deadlock on module removal · 19fa4f2a
      Heiner Kallweit authored
      Binding devm_led_classdev_register() to the netdev is problematic
      because on module removal we get a RTNL-related deadlock. Fix this
      by avoiding the device-managed LED functions.
      
      Note: We can safely call led_classdev_unregister() for a LED even
      if registering it failed, because led_classdev_unregister() detects
      this and is a no-op in this case.
      
      Fixes: 18764b88 ("r8169: add support for LED's on RTL8168/RTL8101")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19fa4f2a
    • Brett Creeley's avatar
      pds_core: Fix pdsc_check_pci_health function to use work thread · 81665adf
      Brett Creeley authored
      When the driver notices fw_status == 0xff it tries to perform a PCI
      reset on itself via pci_reset_function() in the context of the driver's
      health thread. However, pdsc_reset_prepare calls
      pdsc_stop_health_thread(), which attempts to stop/flush the health
      thread. This results in a deadlock because the stop/flush will never
      complete since the driver called pci_reset_function() from the health
      thread context. Fix by changing the pdsc_check_pci_health_function()
      to queue a newly introduced pdsc_pci_reset_thread() on the pdsc's
      work queue.
      
      Unloading the driver in the fw_down/dead state uncovered another issue,
      which can be seen in the following trace:
      
      WARNING: CPU: 51 PID: 6914 at kernel/workqueue.c:1450 __queue_work+0x358/0x440
      [...]
      RIP: 0010:__queue_work+0x358/0x440
      [...]
      Call Trace:
       <TASK>
       ? __warn+0x85/0x140
       ? __queue_work+0x358/0x440
       ? report_bug+0xfc/0x1e0
       ? handle_bug+0x3f/0x70
       ? exc_invalid_op+0x17/0x70
       ? asm_exc_invalid_op+0x1a/0x20
       ? __queue_work+0x358/0x440
       queue_work_on+0x28/0x30
       pdsc_devcmd_locked+0x96/0xe0 [pds_core]
       pdsc_devcmd_reset+0x71/0xb0 [pds_core]
       pdsc_teardown+0x51/0xe0 [pds_core]
       pdsc_remove+0x106/0x200 [pds_core]
       pci_device_remove+0x37/0xc0
       device_release_driver_internal+0xae/0x140
       driver_detach+0x48/0x90
       bus_remove_driver+0x6d/0xf0
       pci_unregister_driver+0x2e/0xa0
       pdsc_cleanup_module+0x10/0x780 [pds_core]
       __x64_sys_delete_module+0x142/0x2b0
       ? syscall_trace_enter.isra.18+0x126/0x1a0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      RIP: 0033:0x7fbd9d03a14b
      [...]
      
      Fix this by preventing the devcmd reset if the FW is not running.
      
      Fixes: d9407ff1 ("pds_core: Prevent health thread from running during reset/remove")
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81665adf
    • Jiri Benc's avatar
      ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr · 7633c4da
      Jiri Benc authored
      Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
      still means hlist_for_each_entry_rcu can return an item that got removed
      from the list. The memory itself of such item is not freed thanks to RCU
      but nothing guarantees the actual content of the memory is sane.
      
      In particular, the reference count can be zero. This can happen if
      ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
      from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
      references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
      timing, this can happen:
      
      1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.
      
      2. Then, the whole ipv6_del_addr is executed for the given entry. The
         reference count drops to zero and kfree_rcu is scheduled.
      
      3. ipv6_get_ifaddr continues and tries to increments the reference count
         (in6_ifa_hold).
      
      4. The rcu is unlocked and the entry is freed.
      
      5. The freed entry is returned.
      
      Prevent increasing of the reference count in such case. The name
      in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.
      
      [   41.506330] refcount_t: addition on 0; use-after-free.
      [   41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
      [   41.507413] Modules linked in: veth bridge stp llc
      [   41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be8 #14
      [   41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      [   41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
      [   41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
      [   41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
      [   41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
      [   41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
      [   41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
      [   41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
      [   41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
      [   41.514086] FS:  00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
      [   41.514726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
      [   41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   41.516799] Call Trace:
      [   41.517037]  <TASK>
      [   41.517249]  ? __warn+0x7b/0x120
      [   41.517535]  ? refcount_warn_saturate+0xa5/0x130
      [   41.517923]  ? report_bug+0x164/0x190
      [   41.518240]  ? handle_bug+0x3d/0x70
      [   41.518541]  ? exc_invalid_op+0x17/0x70
      [   41.520972]  ? asm_exc_invalid_op+0x1a/0x20
      [   41.521325]  ? refcount_warn_saturate+0xa5/0x130
      [   41.521708]  ipv6_get_ifaddr+0xda/0xe0
      [   41.522035]  inet6_rtm_getaddr+0x342/0x3f0
      [   41.522376]  ? __pfx_inet6_rtm_getaddr+0x10/0x10
      [   41.522758]  rtnetlink_rcv_msg+0x334/0x3d0
      [   41.523102]  ? netlink_unicast+0x30f/0x390
      [   41.523445]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
      [   41.523832]  netlink_rcv_skb+0x53/0x100
      [   41.524157]  netlink_unicast+0x23b/0x390
      [   41.524484]  netlink_sendmsg+0x1f2/0x440
      [   41.524826]  __sys_sendto+0x1d8/0x1f0
      [   41.525145]  __x64_sys_sendto+0x1f/0x30
      [   41.525467]  do_syscall_64+0xa5/0x1b0
      [   41.525794]  entry_SYSCALL_64_after_hwframe+0x72/0x7a
      [   41.526213] RIP: 0033:0x7fbc4cfcea9a
      [   41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      [   41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [   41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
      [   41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
      [   41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
      [   41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [   41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
      [   41.531573]  </TASK>
      
      Fixes: 5c578aed ("IPv6: convert addrconf hash list to RCU")
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7633c4da
    • Jakub Kicinski's avatar
      Merge branch 'net-start-to-replace-copy_from_sockptr' · 7b6575c6
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      net: start to replace copy_from_sockptr()
      
      We got several syzbot reports about unsafe copy_from_sockptr()
      calls. After fixing some of them, it appears that we could
      use a new helper to factorize all the checks in one place.
      
      This series targets net tree, we can later start converting
      many call sites in net-next.
      ====================
      
      Link: https://lore.kernel.org/r/20240408082845.3957374-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7b6575c6
    • Eric Dumazet's avatar
      nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies · 7a87441c
      Eric Dumazet authored
      syzbot reported unsafe calls to copy_from_sockptr() [1]
      
      Use copy_safe_from_sockptr() instead.
      
      [1]
      
      BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
       BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
       BUG: KASAN: slab-out-of-bounds in nfc_llcp_setsockopt+0x6c2/0x850 net/nfc/llcp_sock.c:255
      Read of size 4 at addr ffff88801caa1ec3 by task syz-executor459/5078
      
      CPU: 0 PID: 5078 Comm: syz-executor459 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
        print_address_description mm/kasan/report.c:377 [inline]
        print_report+0x169/0x550 mm/kasan/report.c:488
        kasan_report+0x143/0x180 mm/kasan/report.c:601
        copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
        copy_from_sockptr include/linux/sockptr.h:55 [inline]
        nfc_llcp_setsockopt+0x6c2/0x850 net/nfc/llcp_sock.c:255
        do_sock_setsockopt+0x3b1/0x720 net/socket.c:2311
        __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
       do_syscall_64+0xfd/0x240
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7f7fac07fd89
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fff660eb788 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f7fac07fd89
      RDX: 0000000000000000 RSI: 0000000000000118 RDI: 0000000000000004
      RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000
      R10: 0000000020000a80 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20240408082845.3957374-4-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7a87441c
    • Eric Dumazet's avatar
      mISDN: fix MISDN_TIME_STAMP handling · 138b7878
      Eric Dumazet authored
      syzbot reports one unsafe call to copy_from_sockptr() [1]
      
      Use copy_safe_from_sockptr() instead.
      
      [1]
      
       BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
       BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
       BUG: KASAN: slab-out-of-bounds in data_sock_setsockopt+0x46c/0x4cc drivers/isdn/mISDN/socket.c:417
      Read of size 4 at addr ffff0000c6d54083 by task syz-executor406/6167
      
      CPU: 1 PID: 6167 Comm: syz-executor406 Not tainted 6.8.0-rc7-syzkaller-g707081b61156 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
      Call trace:
        dump_backtrace+0x1b8/0x1e4 arch/arm64/kernel/stacktrace.c:291
        show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:298
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xd0/0x124 lib/dump_stack.c:106
        print_address_description mm/kasan/report.c:377 [inline]
        print_report+0x178/0x518 mm/kasan/report.c:488
        kasan_report+0xd8/0x138 mm/kasan/report.c:601
        __asan_report_load_n_noabort+0x1c/0x28 mm/kasan/report_generic.c:391
        copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
        copy_from_sockptr include/linux/sockptr.h:55 [inline]
        data_sock_setsockopt+0x46c/0x4cc drivers/isdn/mISDN/socket.c:417
        do_sock_setsockopt+0x2a0/0x4e0 net/socket.c:2311
        __sys_setsockopt+0x128/0x1a8 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __arm64_sys_setsockopt+0xb8/0xd4 net/socket.c:2340
        __invoke_syscall arch/arm64/kernel/syscall.c:34 [inline]
        invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:48
        el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:133
        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:152
        el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712
        el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
        el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
      
      Fixes: 1b2b03f8 ("Add mISDN core files")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Karsten Keil <isdn@linux-pingi.de>
      Link: https://lore.kernel.org/r/20240408082845.3957374-3-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      138b7878
    • Eric Dumazet's avatar
      net: add copy_safe_from_sockptr() helper · 6309863b
      Eric Dumazet authored
      copy_from_sockptr() helper is unsafe, unless callers
      did the prior check against user provided optlen.
      
      Too many callers get this wrong, lets add a helper to
      fix them and avoid future copy/paste bugs.
      
      Instead of :
      
         if (optlen < sizeof(opt)) {
             err = -EINVAL;
             break;
         }
         if (copy_from_sockptr(&opt, optval, sizeof(opt)) {
             err = -EFAULT;
             break;
         }
      
      Use :
      
         err = copy_safe_from_sockptr(&opt, sizeof(opt),
                                      optval, optlen);
         if (err)
             break;
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240408082845.3957374-2-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6309863b
  3. 09 Apr, 2024 6 commits
    • Arnd Bergmann's avatar
      ipv4/route: avoid unused-but-set-variable warning · cf1b7201
      Arnd Bergmann authored
      The log_martians variable is only used in an #ifdef, causing a 'make W=1'
      warning with gcc:
      
      net/ipv4/route.c: In function 'ip_rt_send_redirect':
      net/ipv4/route.c:880:13: error: variable 'log_martians' set but not used [-Werror=unused-but-set-variable]
      
      Change the #ifdef to an equivalent IS_ENABLED() to let the compiler
      see where the variable is used.
      
      Fixes: 30038fc6 ("net: ip_rt_send_redirect() optimization")
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240408074219.3030256-2-arnd@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cf1b7201
    • Arnd Bergmann's avatar
      ipv6: fib: hide unused 'pn' variable · 74043489
      Arnd Bergmann authored
      When CONFIG_IPV6_SUBTREES is disabled, the only user is hidden, causing
      a 'make W=1' warning:
      
      net/ipv6/ip6_fib.c: In function 'fib6_add':
      net/ipv6/ip6_fib.c:1388:32: error: variable 'pn' set but not used [-Werror=unused-but-set-variable]
      
      Add another #ifdef around the variable declaration, matching the other
      uses in this file.
      
      Fixes: 66729e18 ("[IPV6] ROUTE: Make sure we have fn->leaf when adding a node on subtree.")
      Link: https://lore.kernel.org/netdev/20240322131746.904943-1-arnd@kernel.org/Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240408074219.3030256-1-arnd@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      74043489
    • Geetha sowjanya's avatar
      octeontx2-af: Fix NIX SQ mode and BP config · faf23006
      Geetha sowjanya authored
      NIX SQ mode and link backpressure configuration is required for
      all platforms. But in current driver this code is wrongly placed
      under specific platform check. This patch fixes the issue by
      moving the code out of platform check.
      
      Fixes: 5d9b976d ("octeontx2-af: Support fixed transmit scheduler topology")
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Link: https://lore.kernel.org/r/20240408063643.26288-1-gakula@marvell.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      faf23006
    • Kuniyuki Iwashima's avatar
      af_unix: Clear stale u->oob_skb. · b46f4eaa
      Kuniyuki Iwashima authored
      syzkaller started to report deadlock of unix_gc_lock after commit
      4090fa37 ("af_unix: Replace garbage collection algorithm."), but
      it just uncovers the bug that has been there since commit 314001f0
      ("af_unix: Add OOB support").
      
      The repro basically does the following.
      
        from socket import *
        from array import array
      
        c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        c1.sendmsg([b'a'], [(SOL_SOCKET, SCM_RIGHTS, array("i", [c2.fileno()]))], MSG_OOB)
        c2.recv(1)  # blocked as no normal data in recv queue
      
        c2.close()  # done async and unblock recv()
        c1.close()  # done async and trigger GC
      
      A socket sends its file descriptor to itself as OOB data and tries to
      receive normal data, but finally recv() fails due to async close().
      
      The problem here is wrong handling of OOB skb in manage_oob().  When
      recvmsg() is called without MSG_OOB, manage_oob() is called to check
      if the peeked skb is OOB skb.  In such a case, manage_oob() pops it
      out of the receive queue but does not clear unix_sock(sk)->oob_skb.
      This is wrong in terms of uAPI.
      
      Let's say we send "hello" with MSG_OOB, and "world" without MSG_OOB.
      The 'o' is handled as OOB data.  When recv() is called twice without
      MSG_OOB, the OOB data should be lost.
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM, 0)
        >>> c1.send(b'hello', MSG_OOB)  # 'o' is OOB data
        5
        >>> c1.send(b'world')
        5
        >>> c2.recv(5)  # OOB data is not received
        b'hell'
        >>> c2.recv(5)  # OOB date is skipped
        b'world'
        >>> c2.recv(5, MSG_OOB)  # This should return an error
        b'o'
      
      In the same situation, TCP actually returns -EINVAL for the last
      recv().
      
      Also, if we do not clear unix_sk(sk)->oob_skb, unix_poll() always set
      EPOLLPRI even though the data has passed through by previous recv().
      
      To avoid these issues, we must clear unix_sk(sk)->oob_skb when dequeuing
      it from recv queue.
      
      The reason why the old GC did not trigger the deadlock is because the
      old GC relied on the receive queue to detect the loop.
      
      When it is triggered, the socket with OOB data is marked as GC candidate
      because file refcount == inflight count (1).  However, after traversing
      all inflight sockets, the socket still has a positive inflight count (1),
      thus the socket is excluded from candidates.  Then, the old GC lose the
      chance to garbage-collect the socket.
      
      With the old GC, the repro continues to create true garbage that will
      never be freed nor detected by kmemleak as it's linked to the global
      inflight list.  That's why we couldn't even notice the issue.
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Reported-by: syzbot+7f7f201cc2668a8fd169@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=7f7f201cc2668a8fd169Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240405221057.2406-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b46f4eaa
    • Marek Vasut's avatar
      net: ks8851: Handle softirqs at the end of IRQ thread to fix hang · be0384bf
      Marek Vasut authored
      The ks8851_irq() thread may call ks8851_rx_pkts() in case there are
      any packets in the MAC FIFO, which calls netif_rx(). This netif_rx()
      implementation is guarded by local_bh_disable() and local_bh_enable().
      The local_bh_enable() may call do_softirq() to run softirqs in case
      any are pending. One of the softirqs is net_rx_action, which ultimately
      reaches the driver .start_xmit callback. If that happens, the system
      hangs. The entire call chain is below:
      
      ks8851_start_xmit_par from netdev_start_xmit
      netdev_start_xmit from dev_hard_start_xmit
      dev_hard_start_xmit from sch_direct_xmit
      sch_direct_xmit from __dev_queue_xmit
      __dev_queue_xmit from __neigh_update
      __neigh_update from neigh_update
      neigh_update from arp_process.constprop.0
      arp_process.constprop.0 from __netif_receive_skb_one_core
      __netif_receive_skb_one_core from process_backlog
      process_backlog from __napi_poll.constprop.0
      __napi_poll.constprop.0 from net_rx_action
      net_rx_action from __do_softirq
      __do_softirq from call_with_stack
      call_with_stack from do_softirq
      do_softirq from __local_bh_enable_ip
      __local_bh_enable_ip from netif_rx
      netif_rx from ks8851_irq
      ks8851_irq from irq_thread_fn
      irq_thread_fn from irq_thread
      irq_thread from kthread
      kthread from ret_from_fork
      
      The hang happens because ks8851_irq() first locks a spinlock in
      ks8851_par.c ks8851_lock_par() spin_lock_irqsave(&ksp->lock, ...)
      and with that spinlock locked, calls netif_rx(). Once the execution
      reaches ks8851_start_xmit_par(), it calls ks8851_lock_par() again
      which attempts to claim the already locked spinlock again, and the
      hang happens.
      
      Move the do_softirq() call outside of the spinlock protected section
      of ks8851_irq() by disabling BHs around the entire spinlock protected
      section of ks8851_irq() handler. Place local_bh_enable() outside of
      the spinlock protected section, so that it can trigger do_softirq()
      without the ks8851_par.c ks8851_lock_par() spinlock being held, and
      safely call ks8851_start_xmit_par() without attempting to lock the
      already locked spinlock.
      
      Since ks8851_irq() is protected by local_bh_disable()/local_bh_enable()
      now, replace netif_rx() with __netif_rx() which is not duplicating the
      local_bh_disable()/local_bh_enable() calls.
      
      Fixes: 797047f8 ("net: ks8851: Implement Parallel bus operations")
      Signed-off-by: default avatarMarek Vasut <marex@denx.de>
      Link: https://lore.kernel.org/r/20240405203204.82062-2-marex@denx.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      be0384bf
    • Marek Vasut's avatar
      net: ks8851: Inline ks8851_rx_skb() · f96f7004
      Marek Vasut authored
      Both ks8851_rx_skb_par() and ks8851_rx_skb_spi() call netif_rx(skb),
      inline the netif_rx(skb) call directly into ks8851_common.c and drop
      the .rx_skb callback and ks8851_rx_skb() wrapper. This removes one
      indirect call from the driver, no functional change otherwise.
      Signed-off-by: default avatarMarek Vasut <marex@denx.de>
      Link: https://lore.kernel.org/r/20240405203204.82062-1-marex@denx.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f96f7004
  4. 08 Apr, 2024 7 commits