1. 17 Apr, 2024 2 commits
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: restore set elements when delete set fails · e79b47a8
      Pablo Neira Ayuso authored
      From abort path, nft_mapelem_activate() needs to restore refcounters to
      the original state. Currently, it uses the set->ops->walk() to iterate
      over these set elements. The existing set iterator skips inactive
      elements in the next generation, this does not work from the abort path
      to restore the original state since it has to skip active elements
      instead (not inactive ones).
      
      This patch moves the check for inactive elements to the set iterator
      callback, then it reverses the logic for the .activate case which
      needs to skip active elements.
      
      Toggle next generation bit for elements when delete set command is
      invoked and call nft_clear() from .activate (abort) path to restore the
      next generation bit.
      
      The splat below shows an object in mappings memleak:
      
      [43929.457523] ------------[ cut here ]------------
      [43929.457532] WARNING: CPU: 0 PID: 1139 at include/net/netfilter/nf_tables.h:1237 nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
      [...]
      [43929.458014] RIP: 0010:nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
      [43929.458076] Code: 83 f8 01 77 ab 49 8d 7c 24 08 e8 37 5e d0 de 49 8b 6c 24 08 48 8d 7d 50 e8 e9 5c d0 de 8b 45 50 8d 50 ff 89 55 50 85 c0 75 86 <0f> 0b eb 82 0f 0b eb b3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90
      [43929.458081] RSP: 0018:ffff888140f9f4b0 EFLAGS: 00010246
      [43929.458086] RAX: 0000000000000000 RBX: ffff8881434f5288 RCX: dffffc0000000000
      [43929.458090] RDX: 00000000ffffffff RSI: ffffffffa26d28a7 RDI: ffff88810ecc9550
      [43929.458093] RBP: ffff88810ecc9500 R08: 0000000000000001 R09: ffffed10281f3e8f
      [43929.458096] R10: 0000000000000003 R11: ffff0000ffff0000 R12: ffff8881434f52a0
      [43929.458100] R13: ffff888140f9f5f4 R14: ffff888151c7a800 R15: 0000000000000002
      [43929.458103] FS:  00007f0c687c4740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000
      [43929.458107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [43929.458111] CR2: 00007f58dbe5b008 CR3: 0000000123602005 CR4: 00000000001706f0
      [43929.458114] Call Trace:
      [43929.458118]  <TASK>
      [43929.458121]  ? __warn+0x9f/0x1a0
      [43929.458127]  ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
      [43929.458188]  ? report_bug+0x1b1/0x1e0
      [43929.458196]  ? handle_bug+0x3c/0x70
      [43929.458200]  ? exc_invalid_op+0x17/0x40
      [43929.458211]  ? nft_setelem_data_deactivate+0xd7/0xf0 [nf_tables]
      [43929.458271]  ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
      [43929.458332]  nft_mapelem_deactivate+0x24/0x30 [nf_tables]
      [43929.458392]  nft_rhash_walk+0xdd/0x180 [nf_tables]
      [43929.458453]  ? __pfx_nft_rhash_walk+0x10/0x10 [nf_tables]
      [43929.458512]  ? rb_insert_color+0x2e/0x280
      [43929.458520]  nft_map_deactivate+0xdc/0x1e0 [nf_tables]
      [43929.458582]  ? __pfx_nft_map_deactivate+0x10/0x10 [nf_tables]
      [43929.458642]  ? __pfx_nft_mapelem_deactivate+0x10/0x10 [nf_tables]
      [43929.458701]  ? __rcu_read_unlock+0x46/0x70
      [43929.458709]  nft_delset+0xff/0x110 [nf_tables]
      [43929.458769]  nft_flush_table+0x16f/0x460 [nf_tables]
      [43929.458830]  nf_tables_deltable+0x501/0x580 [nf_tables]
      
      Fixes: 628bd3e4 ("netfilter: nf_tables: drop map element references from preparation phase")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e79b47a8
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: missing iterator type in lookup walk · efefd4f0
      Pablo Neira Ayuso authored
      Add missing decorator type to lookup expression and tighten WARN_ON_ONCE
      check in pipapo to spot earlier that this is unset.
      
      Fixes: 29b359cf ("netfilter: nft_set_pipapo: walk over current view on netlink dump")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      efefd4f0
  2. 15 Apr, 2024 2 commits
  3. 14 Apr, 2024 1 commit
    • Yuri Benditovich's avatar
      net: change maximum number of UDP segments to 128 · 1382e3b6
      Yuri Benditovich authored
      The commit fc8b2a61
      ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation")
      adds check of potential number of UDP segments vs
      UDP_MAX_SEGMENTS in linux/virtio_net.h.
      After this change certification test of USO guest-to-guest
      transmit on Windows driver for virtio-net device fails,
      for example with packet size of ~64K and mss of 536 bytes.
      In general the USO should not be more restrictive than TSO.
      Indeed, in case of unreasonably small mss a lot of segments
      can cause queue overflow and packet loss on the destination.
      Limit of 128 segments is good for any practical purpose,
      with minimal meaningful mss of 536 the maximal UDP packet will
      be divided to ~120 segments.
      The number of segments for UDP packets is validated vs
      UDP_MAX_SEGMENTS also in udp.c (v4,v6), this does not affect
      quest-to-guest path but does affect packets sent to host, for
      example.
      It is important to mention that UDP_MAX_SEGMENTS is kernel-only
      define and not available to user mode socket applications.
      In order to request MSS smaller than MTU the applications
      just uses setsockopt with SOL_UDP and UDP_SEGMENT and there is
      no limitations on socket API level.
      
      Fixes: fc8b2a61 ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation")
      Signed-off-by: default avatarYuri Benditovich <yuri.benditovich@daynix.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1382e3b6
  4. 13 Apr, 2024 11 commits
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-fixes' · 72041e53
      Jakub Kicinski authored
      Tariq Toukan says:
      
      ====================
      mlx5 fixes
      
      This patchset provides bug fixes to mlx5 core and Eth drivers.
      ====================
      
      Link: https://lore.kernel.org/r/20240411115444.374475-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      72041e53
    • Carolina Jubran's avatar
      net/mlx5e: Prevent deadlock while disabling aRFS · fef96576
      Carolina Jubran authored
      When disabling aRFS under the `priv->state_lock`, any scheduled
      aRFS works are canceled using the `cancel_work_sync` function,
      which waits for the work to end if it has already started.
      However, while waiting for the work handler, the handler will
      try to acquire the `state_lock` which is already acquired.
      
      The worker acquires the lock to delete the rules if the state
      is down, which is not the worker's responsibility since
      disabling aRFS deletes the rules.
      
      Add an aRFS state variable, which indicates whether the aRFS is
      enabled and prevent adding rules when the aRFS is disabled.
      
      Kernel log:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.7.0-rc4_net_next_mlx5_5483eb2 #1 Tainted: G          I
      ------------------------------------------------------
      ethtool/386089 is trying to acquire lock:
      ffff88810f21ce68 ((work_completion)(&rule->arfs_work)){+.+.}-{0:0}, at: __flush_work+0x74/0x4e0
      
      but task is already holding lock:
      ffff8884a1808cc0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_ethtool_set_channels+0x53/0x200 [mlx5_core]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&priv->state_lock){+.+.}-{3:3}:
             __mutex_lock+0x80/0xc90
             arfs_handle_work+0x4b/0x3b0 [mlx5_core]
             process_one_work+0x1dc/0x4a0
             worker_thread+0x1bf/0x3c0
             kthread+0xd7/0x100
             ret_from_fork+0x2d/0x50
             ret_from_fork_asm+0x11/0x20
      
      -> #0 ((work_completion)(&rule->arfs_work)){+.+.}-{0:0}:
             __lock_acquire+0x17b4/0x2c80
             lock_acquire+0xd0/0x2b0
             __flush_work+0x7a/0x4e0
             __cancel_work_timer+0x131/0x1c0
             arfs_del_rules+0x143/0x1e0 [mlx5_core]
             mlx5e_arfs_disable+0x1b/0x30 [mlx5_core]
             mlx5e_ethtool_set_channels+0xcb/0x200 [mlx5_core]
             ethnl_set_channels+0x28f/0x3b0
             ethnl_default_set_doit+0xec/0x240
             genl_family_rcv_msg_doit+0xd0/0x120
             genl_rcv_msg+0x188/0x2c0
             netlink_rcv_skb+0x54/0x100
             genl_rcv+0x24/0x40
             netlink_unicast+0x1a1/0x270
             netlink_sendmsg+0x214/0x460
             __sock_sendmsg+0x38/0x60
             __sys_sendto+0x113/0x170
             __x64_sys_sendto+0x20/0x30
             do_syscall_64+0x40/0xe0
             entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&priv->state_lock);
                                     lock((work_completion)(&rule->arfs_work));
                                     lock(&priv->state_lock);
        lock((work_completion)(&rule->arfs_work));
      
       *** DEADLOCK ***
      
      3 locks held by ethtool/386089:
       #0: ffffffff82ea7210 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
       #1: ffffffff82e94c88 (rtnl_mutex){+.+.}-{3:3}, at: ethnl_default_set_doit+0xd3/0x240
       #2: ffff8884a1808cc0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_ethtool_set_channels+0x53/0x200 [mlx5_core]
      
      stack backtrace:
      CPU: 15 PID: 386089 Comm: ethtool Tainted: G          I        6.7.0-rc4_net_next_mlx5_5483eb2 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x60/0xa0
       check_noncircular+0x144/0x160
       __lock_acquire+0x17b4/0x2c80
       lock_acquire+0xd0/0x2b0
       ? __flush_work+0x74/0x4e0
       ? save_trace+0x3e/0x360
       ? __flush_work+0x74/0x4e0
       __flush_work+0x7a/0x4e0
       ? __flush_work+0x74/0x4e0
       ? __lock_acquire+0xa78/0x2c80
       ? lock_acquire+0xd0/0x2b0
       ? mark_held_locks+0x49/0x70
       __cancel_work_timer+0x131/0x1c0
       ? mark_held_locks+0x49/0x70
       arfs_del_rules+0x143/0x1e0 [mlx5_core]
       mlx5e_arfs_disable+0x1b/0x30 [mlx5_core]
       mlx5e_ethtool_set_channels+0xcb/0x200 [mlx5_core]
       ethnl_set_channels+0x28f/0x3b0
       ethnl_default_set_doit+0xec/0x240
       genl_family_rcv_msg_doit+0xd0/0x120
       genl_rcv_msg+0x188/0x2c0
       ? ethnl_ops_begin+0xb0/0xb0
       ? genl_family_rcv_msg_dumpit+0xf0/0xf0
       netlink_rcv_skb+0x54/0x100
       genl_rcv+0x24/0x40
       netlink_unicast+0x1a1/0x270
       netlink_sendmsg+0x214/0x460
       __sock_sendmsg+0x38/0x60
       __sys_sendto+0x113/0x170
       ? do_user_addr_fault+0x53f/0x8f0
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x40/0xe0
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
       </TASK>
      
      Fixes: 45bf454a ("net/mlx5e: Enabling aRFS mechanism")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-7-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fef96576
    • Carolina Jubran's avatar
      net/mlx5e: Acquire RTNL lock before RQs/SQs activation/deactivation · fdce06bd
      Carolina Jubran authored
      netif_queue_set_napi asserts whether RTNL lock is held if
      the netdev is initialized.
      
      Acquire the RTNL lock before activating or deactivating
      RQs/SQs if the lock has not been held before in the flow.
      
      Fixes: f25e7b82 ("net/mlx5e: link NAPI instances to queues and IRQs")
      Cc: Joe Damato <jdamato@fastly.com>
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Reviewed-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-6-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fdce06bd
    • Rahul Rameshbabu's avatar
      net/mlx5e: Use channel mdev reference instead of global mdev instance for coalescing · 6c685bdb
      Rahul Rameshbabu authored
      Channels can potentially have independent mdev instances. Do not refer to
      the global mdev instance in the mlx5e_priv instance for channel FW
      operations related to coalescing. CQ numbers that would be valid on the
      channel's mdev instance may not be correctly referenced if using the
      mlx5e_priv instance.
      
      Fixes: 67936e13 ("net/mlx5e: Let channels be SD-aware")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-5-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c685bdb
    • Shay Drory's avatar
      net/mlx5: Restore mistakenly dropped parts in register devlink flow · bf729988
      Shay Drory authored
      Code parts from cited commit were mistakenly dropped while rebasing
      before submission. Add them here.
      
      Fixes: c6e77aa9 ("net/mlx5: Register devlink first under devlink lock")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-4-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bf729988
    • Tariq Toukan's avatar
      net/mlx5: SD, Handle possible devcom ERR_PTR · aa4ac90d
      Tariq Toukan authored
      Check if devcom holds an error pointer and return immediately.
      
      This fixes Smatch static checker warning:
      drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c:221 sd_register()
      error: 'devcom' dereferencing possible ERR_PTR()
      
      Enhance mlx5_devcom_register_component() so it stops returning NULL,
      making it easier for its callers.
      
      Fixes: d3d05766 ("net/mlx5: SD, Implement devcom communication and primary election")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Link: https://lore.kernel.org/all/f09666c8-e604-41f6-958b-4cc55c73faf9@gmail.com/T/Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-3-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aa4ac90d
    • Shay Drory's avatar
      net/mlx5: Lag, restore buckets number to default after hash LAG deactivation · 37cc10da
      Shay Drory authored
      The cited patch introduces the concept of buckets in LAG in hash mode.
      However, the patch doesn't clear the number of buckets in the LAG
      deactivation. This results in using the wrong number of buckets in
      case user create a hash mode LAG and afterwards create a non-hash
      mode LAG.
      
      Hence, restore buckets number to default after hash mode LAG
      deactivation.
      
      Fixes: 352899f3 ("net/mlx5: Lag, use buckets in hash mode")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-2-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      37cc10da
    • Asbjørn Sloth Tønnesen's avatar
      net: sparx5: flower: fix fragment flags handling · 68aba004
      Asbjørn Sloth Tønnesen authored
      I noticed that only 3 out of the 4 input bits were used,
      mt.key->flags & FLOW_DIS_IS_FRAGMENT was never checked.
      
      In order to avoid a complicated maze, I converted it to
      use a 16 byte mapping table.
      
      As shown in the table below the old heuristics doesn't
      always do the right thing, ie. when FLOW_DIS_IS_FRAGMENT=1/1
      then it used to only match follow-up fragment packets.
      
      Here are all the combinations, and their resulting new/old
      VCAP key/mask filter:
      
        /- FLOW_DIS_IS_FRAGMENT (key/mask)
        |    /- FLOW_DIS_FIRST_FRAG (key/mask)
        |    |    /-- new VCAP fragment (key/mask)
        v    v    v    v- old VCAP fragment (key/mask)
      
       0/0  0/0  -/-  -/-     impossible (due to entry cond. on mask)
       0/0  0/1  -/-  0/3 !!  invalid (can't match non-fragment + follow-up frag)
       0/0  1/0  -/-  -/-     impossible (key > mask)
       0/0  1/1  1/3  1/3     first fragment
      
       0/1  0/0  0/3  3/3 !!  not fragmented
       0/1  0/1  0/3  3/3 !!  not fragmented (+ not first fragment)
       0/1  1/0  -/-  -/-     impossible (key > mask)
       0/1  1/1  -/-  1/3 !!  invalid (non-fragment and first frag)
      
       1/0  0/0  -/-  -/-     impossible (key > mask)
       1/0  0/1  -/-  -/-     impossible (key > mask)
       1/0  1/0  -/-  -/-     impossible (key > mask)
       1/0  1/1  -/-  -/-     impossible (key > mask)
      
       1/1  0/0  1/1  3/3 !!  some fragment
       1/1  0/1  3/3  3/3     follow-up fragment
       1/1  1/0  -/-  -/-     impossible (key > mask)
       1/1  1/1  1/3  1/3     first fragment
      
      In the datasheet the VCAP fragment values are documented as:
       0 = no fragment
       1 = initial fragment
       2 = suspicious fragment
       3 = valid follow-up fragment
      
      Result: 3 combinations match the old behavior,
              3 combinations have been corrected,
              2 combinations are now invalid, and fail,
              8 combinations are impossible.
      
      It should now be aligned with how FLOW_DIS_IS_FRAGMENT
      and FLOW_DIS_FIRST_FRAG is set in __skb_flow_dissect() in
      net/core/flow_dissector.c
      
      Since the VCAP fragment values are not a bitfield, we have
      to ignore the suspicious fragment value, eg. when matching
      on any kind of fragment with FLOW_DIS_IS_FRAGMENT=1/1.
      
      Only compile tested, and logic tested in userspace, as I
      unfortunately don't have access to this switch chip (yet).
      
      Fixes: d6c2964d ("net: microchip: sparx5: Adding more tc flower keys for the IS2 VCAP")
      Signed-off-by: default avatarAsbjørn Sloth Tønnesen <ast@fiberby.net>
      Reviewed-by: default avatarSteen Hegelund <Steen.Hegelund@microchip.com>
      Tested-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240411111321.114095-1-ast@fiberby.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68aba004
    • Jakub Kicinski's avatar
      Merge branch 'af_unix-fix-msg_oob-bugs-with-msg_peek' · 27f58f7f
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix MSG_OOB bugs with MSG_PEEK.
      
      Currently, OOB data can be read without MSG_OOB accidentally
      in two cases, and this seris fixes the bugs.
      
      v1: https://lore.kernel.org/netdev/20240409225209.58102-1-kuniyu@amazon.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240410171016.7621-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      27f58f7f
    • Kuniyuki Iwashima's avatar
      af_unix: Don't peek OOB data without MSG_OOB. · 22dd70eb
      Kuniyuki Iwashima authored
      Currently, we can read OOB data without MSG_OOB by using MSG_PEEK
      when OOB data is sitting on the front row, which is apparently
      wrong.
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'a', MSG_OOB)
        1
        >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
        b'a'
      
      If manage_oob() is called when no data has been copied, we only
      check if the socket enables SO_OOBINLINE or MSG_PEEK is not used.
      Otherwise, the skb is returned as is.
      
      However, here we should return NULL if MSG_PEEK is set and no data
      has been copied.
      
      Also, in such a case, we should not jump to the redo label because
      we will be caught in the loop and hog the CPU until normal data
      comes in.
      
      Then, we need to handle skb == NULL case with the if-clause below
      the manage_oob() block.
      
      With this patch:
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'a', MSG_OOB)
        1
        >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        BlockingIOError: [Errno 11] Resource temporarily unavailable
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240410171016.7621-3-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      22dd70eb
    • Kuniyuki Iwashima's avatar
      af_unix: Call manage_oob() for every skb in unix_stream_read_generic(). · 283454c8
      Kuniyuki Iwashima authored
      When we call recv() for AF_UNIX socket, we first peek one skb and
      calls manage_oob() to check if the skb is sent with MSG_OOB.
      
      However, when we fetch the next (and the following) skb, manage_oob()
      is not called now, leading a wrong behaviour.
      
      Let's say a socket send()s "hello" with MSG_OOB and the peer tries
      to recv() 5 bytes with MSG_PEEK.  Here, we should get only "hell"
      without 'o', but actually not:
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'hello', MSG_OOB)
        5
        >>> c2.recv(5, MSG_PEEK)
        b'hello'
      
      The first skb fills 4 bytes, and the next skb is peeked but not
      properly checked by manage_oob().
      
      Let's move up the again label to call manage_oob() for evry skb.
      
      With this patch:
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'hello', MSG_OOB)
        5
        >>> c2.recv(5, MSG_PEEK)
        b'hell'
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240410171016.7621-2-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      283454c8
  5. 12 Apr, 2024 1 commit
    • David S. Miller's avatar
      Merge tag 'nf-24-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 90be7a5c
      David S. Miller authored
      netfilter pull request 24-04-11
      
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      Patches #1 and #2 add missing rcu read side lock when iterating over
      expression and object type list which could race with module removal.
      
      Patch #3 prevents promisc packet from visiting the bridge/input hook
      	 to amend a recent fix to address conntrack confirmation race
      	 in br_netfilter and nf_conntrack_bridge.
      
      Patch #4 adds and uses iterate decorator type to fetch the current
      	 pipapo set backend datastructure view when netlink dumps the
      	 set elements.
      
      Patch #5 fixes removal of duplicate elements in the pipapo set backend.
      
      Patch #6 flowtable validates pppoe header before accessing it.
      
      Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
               fails and pppoe packets follow classic path.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90be7a5c
  6. 11 Apr, 2024 23 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 2ae9a897
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bluetooth.
      
        Current release - new code bugs:
      
         - netfilter: complete validation of user input
      
         - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev
      
        Previous releases - regressions:
      
         - core: fix u64_stats_init() for lockdep when used repeatedly in one
           file
      
         - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
      
         - bluetooth: fix memory leak in hci_req_sync_complete()
      
         - batman-adv: avoid infinite loop trying to resize local TT
      
         - drv: geneve: fix header validation in geneve[6]_xmit_skb
      
         - drv: bnxt_en: fix possible memory leak in
           bnxt_rdma_aux_device_init()
      
         - drv: mlx5: offset comp irq index in name by one
      
         - drv: ena: avoid double-free clearing stale tx_info->xdpf value
      
         - drv: pds_core: fix pdsc_check_pci_health deadlock
      
        Previous releases - always broken:
      
         - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
      
         - bluetooth: fix setsockopt not validating user input
      
         - af_unix: clear stale u->oob_skb.
      
         - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies
      
         - drv: virtio_net: fix guest hangup on invalid RSS update
      
         - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow
      
         - dsa: mt7530: trap link-local frames regardless of ST Port State"
      
      * tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
        net: ena: Set tx_info->xdpf value to NULL
        net: ena: Fix incorrect descriptor free behavior
        net: ena: Wrong missing IO completions check order
        net: ena: Fix potential sign extension issue
        af_unix: Fix garbage collector racing against connect()
        net: dsa: mt7530: trap link-local frames regardless of ST Port State
        Revert "s390/ism: fix receive message buffer allocation"
        net: sparx5: fix wrong config being used when reconfiguring PCS
        net/mlx5: fix possible stack overflows
        net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
        net/mlx5e: RSS, Block XOR hash with over 128 channels
        net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
        net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
        net/mlx5e: Fix mlx5e_priv_init() cleanup flow
        net/mlx5e: RSS, Block changing channels number when RXFH is configured
        net/mlx5: Correctly compare pkt reformat ids
        net/mlx5: Properly link new fs rules into the tree
        net/mlx5: offset comp irq index in name by one
        net/mlx5: Register devlink first under devlink lock
        net/mlx5: E-switch, store eswitch pointer before registering devlink_param
        ...
      2ae9a897
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · ab4319fd
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "The most important fix is the sg one because the regression it fixes
        (spurious warning and use after final put) is already backported to
        stable.
      
        The next biggest impact is the target fix for wrong credentials used
        to load a module because it's affecting new kernels installed on
        selinux based distributions.
      
        The other three fixes are an obvious off by one and SATA protocol
        issues"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
        scsi: hisi_sas: Modify the deadline for ata_wait_after_reset()
        scsi: hisi_sas: Handle the NCQ error returned by D2H frame
        scsi: target: Fix SELinux error when systemd-modules loads the target module
        scsi: sg: Avoid race in error handling & drop bogus warn
      ab4319fd
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.9-1' of... · 5de6b467
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
      
       - make {virt, phys, page, pfn} translation work with KFENCE for
         LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE
         enabled)
      
       - update dts files for Loongson-2K series to make devices work
         correctly
      
       - fix a build error
      
      * tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors
        LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET
        LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI
        LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC
        LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC
        LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE
        LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE
        mm: Move lowmem_page_address() a little later
      5de6b467
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs · e1dc191d
      Linus Torvalds authored
      Pull more bcachefs fixes from Kent Overstreet:
       "Notable user impacting bugs
      
         - On multi device filesystems, recovery was looping in
           btree_trans_too_many_iters(). This checks if a transaction has
           touched too many btree paths (because of iteration over many keys),
           and isuses a restart to drop unneeded paths.
      
           But it's now possible for some paths to exceed the previous limit
           without iteration in the interior btree update path, since the
           transaction commit will do alloc updates for every old and new
           btree node, and during journal replay we don't use the btree write
           buffer for locking reasons and thus those updates use btree paths
           when they wouldn't normally.
      
         - Fix a corner case in rebalance when moving extents on a
           durability=0 device. This wouldn't be hit when a device was
           formatted with durability=0 since in that case we'll only use it as
           a write through cache (only cached extents will live on it), but
           durability can now be changed on an existing device.
      
         - bch2_get_acl() could rarely forget to handle a transaction restart;
           this manifested as the occasional missing acl that came back after
           dropping caches.
      
         - Fix a major performance regression on high iops multithreaded write
           workloads (only since 6.9-rc1); a previous fix for a deadlock in
           the interior btree update path to check the journal watermark
           introduced a dependency on the state of btree write buffer flushing
           that we didn't want.
      
         - Assorted other repair paths and recovery fixes"
      
      * tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
        bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
        bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
        bcachefs: Fix a race in btree_update_nodes_written()
        bcachefs: btree_node_scan: Respect member.data_allowed
        bcachefs: Don't scan for btree nodes when we can reconstruct
        bcachefs: Fix check_topology() when using node scan
        bcachefs: fix eytzinger0_find_gt()
        bcachefs: fix bch2_get_acl() transaction restart handling
        bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
        bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
        bcachefs: Rename struct field swap to prevent macro naming collision
        MAINTAINERS: Add entry for bcachefs documentation
        Documentation: filesystems: Add bcachefs toctree
        bcachefs: JOURNAL_SPACE_LOW
        bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
        bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
        bcachefs: fix rand_delete unit test
        bcachefs: fix ! vs ~ typo in __clear_bit_le64()
        bcachefs: Fix rebalance from durability=0 device
        bcachefs: Print shutdown journal sequence number
        ...
      e1dc191d
    • Linus Torvalds's avatar
      Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of... · 346668f0
      Linus Torvalds authored
      Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
      
      Pull chrome platform fix from Tzung-Bi Shih:
       "Fix a NULL pointer dereference"
      
      * tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
        platform/chrome: cros_ec_uart: properly fix race condition
      346668f0
    • Pablo Neira Ayuso's avatar
      netfilter: flowtable: incorrect pppoe tuple · 6db5dc7b
      Pablo Neira Ayuso authored
      pppoe traffic reaching ingress path does not match the flowtable entry
      because the pppoe header is expected to be at the network header offset.
      This bug causes a mismatch in the flow table lookup, so pppoe packets
      enter the classical forwarding path.
      
      Fixes: 72efd585 ("netfilter: flowtable: add pppoe support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6db5dc7b
    • Pablo Neira Ayuso's avatar
      netfilter: flowtable: validate pppoe header · 87b3593b
      Pablo Neira Ayuso authored
      Ensure there is sufficient room to access the protocol field of the
      PPPoe header. Validate it once before the flowtable lookup, then use a
      helper function to access protocol field.
      
      Reported-by: syzbot+b6f07e1c07ef40199081@syzkaller.appspotmail.com
      Fixes: 72efd585 ("netfilter: flowtable: add pppoe support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      87b3593b
    • Florian Westphal's avatar
      netfilter: nft_set_pipapo: do not free live element · 3cfc9ec0
      Florian Westphal authored
      Pablo reports a crash with large batches of elements with a
      back-to-back add/remove pattern.  Quoting Pablo:
      
        add_elem("00000000") timeout 100 ms
        ...
        add_elem("0000000X") timeout 100 ms
        del_elem("0000000X") <---------------- delete one that was just added
        ...
        add_elem("00005000") timeout 100 ms
      
        1) nft_pipapo_remove() removes element 0000000X
        Then, KASAN shows a splat.
      
      Looking at the remove function there is a chance that we will drop a
      rule that maps to a non-deactivated element.
      
      Removal happens in two steps, first we do a lookup for key k and return the
      to-be-removed element and mark it as inactive in the next generation.
      Then, in a second step, the element gets removed from the set/map.
      
      The _remove function does not work correctly if we have more than one
      element that share the same key.
      
      This can happen if we insert an element into a set when the set already
      holds an element with same key, but the element mapping to the existing
      key has timed out or is not active in the next generation.
      
      In such case its possible that removal will unmap the wrong element.
      If this happens, we will leak the non-deactivated element, it becomes
      unreachable.
      
      The element that got deactivated (and will be freed later) will
      remain reachable in the set data structure, this can result in
      a crash when such an element is retrieved during lookup (stale
      pointer).
      
      Add a check that the fully matching key does in fact map to the element
      that we have marked as inactive in the deactivation step.
      If not, we need to continue searching.
      
      Add a bug/warn trap at the end of the function as well, the remove
      function must not ever be called with an invisible/unreachable/non-existent
      element.
      
      v2: avoid uneeded temporary variable (Stefano)
      
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Reported-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3cfc9ec0
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_pipapo: walk over current view on netlink dump · 29b359cf
      Pablo Neira Ayuso authored
      The generation mask can be updated while netlink dump is in progress.
      The pipapo set backend walk iterator cannot rely on it to infer what
      view of the datastructure is to be used. Add notation to specify if user
      wants to read/update the set.
      
      Based on patch from Florian Westphal.
      
      Fixes: 2b84e215 ("netfilter: nft_set_pipapo: .walk does not deal with generations")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      29b359cf
    • Pablo Neira Ayuso's avatar
      netfilter: br_netfilter: skip conntrack input hook for promisc packets · 751de201
      Pablo Neira Ayuso authored
      For historical reasons, when bridge device is in promisc mode, packets
      that are directed to the taps follow bridge input hook path. This patch
      adds a workaround to reset conntrack for these packets.
      
      Jianbo Liu reports warning splats in their test infrastructure where
      cloned packets reach the br_netfilter input hook to confirm the
      conntrack object.
      
      Scratch one bit from BR_INPUT_SKB_CB to annotate that this packet has
      reached the input hook because it is passed up to the bridge device to
      reach the taps.
      
      [   57.571874] WARNING: CPU: 1 PID: 0 at net/bridge/br_netfilter_hooks.c:616 br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.572749] Modules linked in: xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_isc si ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5ctl mlx5_core
      [   57.575158] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0+ #19
      [   57.575700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [   57.576662] RIP: 0010:br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.577195] Code: fe ff ff 41 bd 04 00 00 00 be 04 00 00 00 e9 4a ff ff ff be 04 00 00 00 48 89 ef e8 f3 a9 3c e1 66 83 ad b4 00 00 00 04 eb 91 <0f> 0b e9 f1 fe ff ff 0f 0b e9 df fe ff ff 48 89 df e8 b3 53 47 e1
      [   57.578722] RSP: 0018:ffff88885f845a08 EFLAGS: 00010202
      [   57.579207] RAX: 0000000000000002 RBX: ffff88812dfe8000 RCX: 0000000000000000
      [   57.579830] RDX: ffff88885f845a60 RSI: ffff8881022dc300 RDI: 0000000000000000
      [   57.580454] RBP: ffff88885f845a60 R08: 0000000000000001 R09: 0000000000000003
      [   57.581076] R10: 00000000ffff1300 R11: 0000000000000002 R12: 0000000000000000
      [   57.581695] R13: ffff8881047ffe00 R14: ffff888108dbee00 R15: ffff88814519b800
      [   57.582313] FS:  0000000000000000(0000) GS:ffff88885f840000(0000) knlGS:0000000000000000
      [   57.583040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   57.583564] CR2: 000000c4206aa000 CR3: 0000000103847001 CR4: 0000000000370eb0
      [   57.584194] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [   57.584820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [   57.585440] Call Trace:
      [   57.585721]  <IRQ>
      [   57.585976]  ? __warn+0x7d/0x130
      [   57.586323]  ? br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.586811]  ? report_bug+0xf1/0x1c0
      [   57.587177]  ? handle_bug+0x3f/0x70
      [   57.587539]  ? exc_invalid_op+0x13/0x60
      [   57.587929]  ? asm_exc_invalid_op+0x16/0x20
      [   57.588336]  ? br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.588825]  nf_hook_slow+0x3d/0xd0
      [   57.589188]  ? br_handle_vlan+0x4b/0x110
      [   57.589579]  br_pass_frame_up+0xfc/0x150
      [   57.589970]  ? br_port_flags_change+0x40/0x40
      [   57.590396]  br_handle_frame_finish+0x346/0x5e0
      [   57.590837]  ? ipt_do_table+0x32e/0x430
      [   57.591221]  ? br_handle_local_finish+0x20/0x20
      [   57.591656]  br_nf_hook_thresh+0x4b/0xf0 [br_netfilter]
      [   57.592286]  ? br_handle_local_finish+0x20/0x20
      [   57.592802]  br_nf_pre_routing_finish+0x178/0x480 [br_netfilter]
      [   57.593348]  ? br_handle_local_finish+0x20/0x20
      [   57.593782]  ? nf_nat_ipv4_pre_routing+0x25/0x60 [nf_nat]
      [   57.594279]  br_nf_pre_routing+0x24c/0x550 [br_netfilter]
      [   57.594780]  ? br_nf_hook_thresh+0xf0/0xf0 [br_netfilter]
      [   57.595280]  br_handle_frame+0x1f3/0x3d0
      [   57.595676]  ? br_handle_local_finish+0x20/0x20
      [   57.596118]  ? br_handle_frame_finish+0x5e0/0x5e0
      [   57.596566]  __netif_receive_skb_core+0x25b/0xfc0
      [   57.597017]  ? __napi_build_skb+0x37/0x40
      [   57.597418]  __netif_receive_skb_list_core+0xfb/0x220
      
      Fixes: 62e7151a ("netfilter: bridge: confirm multicast packets before passing them up the stack")
      Reported-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      751de201
    • Ziyang Xuan's avatar
      netfilter: nf_tables: Fix potential data-race in __nft_obj_type_get() · d78d867d
      Ziyang Xuan authored
      nft_unregister_obj() can concurrent with __nft_obj_type_get(),
      and there is not any protection when iterate over nf_tables_objects
      list in __nft_obj_type_get(). Therefore, there is potential data-race
      of nf_tables_objects list entry.
      
      Use list_for_each_entry_rcu() to iterate over nf_tables_objects
      list in __nft_obj_type_get(), and use rcu_read_lock() in the caller
      nft_obj_type_get() to protect the entire type query process.
      
      Fixes: e5009240 ("netfilter: nf_tables: add stateful objects")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d78d867d
    • Ziyang Xuan's avatar
      netfilter: nf_tables: Fix potential data-race in __nft_expr_type_get() · f969eb84
      Ziyang Xuan authored
      nft_unregister_expr() can concurrent with __nft_expr_type_get(),
      and there is not any protection when iterate over nf_tables_expressions
      list in __nft_expr_type_get(). Therefore, there is potential data-race
      of nf_tables_expressions list entry.
      
      Use list_for_each_entry_rcu() to iterate over nf_tables_expressions
      list in __nft_expr_type_get(), and use rcu_read_lock() in the caller
      nft_expr_type_get() to protect the entire type query process.
      
      Fixes: ef1f7df9 ("netfilter: nf_tables: expression ops overloading")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f969eb84
    • Paolo Abeni's avatar
      Merge branch 'ena-driver-bug-fixes' · 4e1ad31c
      Paolo Abeni authored
      David Arinzon says:
      
      ====================
      ENA driver bug fixes
      
      From: David Arinzon <darinzon@amazon.com>
      
      This patchset contains multiple bug fixes for the
      ENA driver.
      ====================
      
      Link: https://lore.kernel.org/r/20240410091358.16289-1-darinzon@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4e1ad31c
    • David Arinzon's avatar
      net: ena: Set tx_info->xdpf value to NULL · 36a1ca01
      David Arinzon authored
      The patch mentioned in the `Fixes` tag removed the explicit assignment
      of tx_info->xdpf to NULL with the justification that there's no need
      to set tx_info->xdpf to NULL and tx_info->num_of_bufs to 0 in case
      of a mapping error. Both values won't be used once the mapping function
      returns an error, and their values would be overridden by the next
      transmitted packet.
      
      While both values do indeed get overridden in the next transmission
      call, the value of tx_info->xdpf is also used to check whether a TX
      descriptor's transmission has been completed (i.e. a completion for it
      was polled).
      
      An example scenario:
      1. Mapping failed, tx_info->xdpf wasn't set to NULL
      2. A VF reset occurred leading to IO resource destruction and
         a call to ena_free_tx_bufs() function
      3. Although the descriptor whose mapping failed was freed by the
         transmission function, it still passes the check
           if (!tx_info->skb)
      
         (skb and xdp_frame are in a union)
      4. The xdp_frame associated with the descriptor is freed twice
      
      This patch returns the assignment of NULL to tx_info->xdpf to make the
      cleaning function knows that the descriptor is already freed.
      
      Fixes: 504fd6a5 ("net: ena: fix DMA mapping function issues in XDP")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      36a1ca01
    • David Arinzon's avatar
      net: ena: Fix incorrect descriptor free behavior · bf02d9fe
      David Arinzon authored
      ENA has two types of TX queues:
      - queues which only process TX packets arriving from the network stack
      - queues which only process TX packets forwarded to it by XDP_REDIRECT
        or XDP_TX instructions
      
      The ena_free_tx_bufs() cycles through all descriptors in a TX queue
      and unmaps + frees every descriptor that hasn't been acknowledged yet
      by the device (uncompleted TX transactions).
      The function assumes that the processed TX queue is necessarily from
      the first category listed above and ends up using napi_consume_skb()
      for descriptors belonging to an XDP specific queue.
      
      This patch solves a bug in which, in case of a VF reset, the
      descriptors aren't freed correctly, leading to crashes.
      
      Fixes: 548c4940 ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      bf02d9fe
    • David Arinzon's avatar
      net: ena: Wrong missing IO completions check order · f7e41718
      David Arinzon authored
      Missing IO completions check is called every second (HZ jiffies).
      This commit fixes several issues with this check:
      
      1. Duplicate queues check:
         Max of 4 queues are scanned on each check due to monitor budget.
         Once reaching the budget, this check exits under the assumption that
         the next check will continue to scan the remainder of the queues,
         but in practice, next check will first scan the last already scanned
         queue which is not necessary and may cause the full queue scan to
         last a couple of seconds longer.
         The fix is to start every check with the next queue to scan.
         For example, on 8 IO queues:
         Bug: [0,1,2,3], [3,4,5,6], [6,7]
         Fix: [0,1,2,3], [4,5,6,7]
      
      2. Unbalanced queues check:
         In case the number of active IO queues is not a multiple of budget,
         there will be checks which don't utilize the full budget
         because the full scan exits when reaching the last queue id.
         The fix is to run every TX completion check with exact queue budget
         regardless of the queue id.
         For example, on 7 IO queues:
         Bug: [0,1,2,3], [4,5,6], [0,1,2,3]
         Fix: [0,1,2,3], [4,5,6,0], [1,2,3,4]
         The budget may be lowered in case the number of IO queues is less
         than the budget (4) to make sure there are no duplicate queues on
         the same check.
         For example, on 3 IO queues:
         Bug: [0,1,2,0], [1,2,0,1]
         Fix: [0,1,2], [0,1,2]
      
      Fixes: 1738cd3e ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
      Signed-off-by: default avatarAmit Bernstein <amitbern@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      f7e41718
    • David Arinzon's avatar
      net: ena: Fix potential sign extension issue · 713a8519
      David Arinzon authored
      Small unsigned types are promoted to larger signed types in
      the case of multiplication, the result of which may overflow.
      In case the result of such a multiplication has its MSB
      turned on, it will be sign extended with '1's.
      This changes the multiplication result.
      
      Code example of the phenomenon:
      -------------------------------
      u16 x, y;
      size_t z1, z2;
      
      x = y = 0xffff;
      printk("x=%x y=%x\n",x,y);
      
      z1 = x*y;
      z2 = (size_t)x*y;
      
      printk("z1=%lx z2=%lx\n", z1, z2);
      
      Output:
      -------
      x=ffff y=ffff
      z1=fffffffffffe0001 z2=fffe0001
      
      The expected result of ffff*ffff is fffe0001, and without the
      explicit casting to avoid the unwanted sign extension we got
      fffffffffffe0001.
      
      This commit adds an explicit casting to avoid the sign extension
      issue.
      
      Fixes: 689b2bda ("net: ena: add functions for handling Low Latency Queues in ena_com")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      713a8519
    • Paolo Abeni's avatar
      Merge tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · fe3eb406
      Paolo Abeni authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
        - L2CAP: Don't double set the HCI_CONN_MGMT_CONNECTED bit
        - Fix memory leak in hci_req_sync_complete
        - hci_sync: Fix using the same interval and window for Coded PHY
        - Fix not validating setsockopt user input
      
      * tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
        Bluetooth: hci_sock: Fix not validating setsockopt user input
        Bluetooth: ISO: Fix not validating setsockopt user input
        Bluetooth: L2CAP: Fix not validating setsockopt user input
        Bluetooth: RFCOMM: Fix not validating setsockopt user input
        Bluetooth: SCO: Fix not validating setsockopt user input
        Bluetooth: Fix memory leak in hci_req_sync_complete()
        Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
        Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
      ====================
      
      Link: https://lore.kernel.org/r/20240410191610.4156653-1-luiz.dentz@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fe3eb406
    • Michal Luczaj's avatar
      af_unix: Fix garbage collector racing against connect() · 47d8ac01
      Michal Luczaj authored
      Garbage collector does not take into account the risk of embryo getting
      enqueued during the garbage collection. If such embryo has a peer that
      carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
      different set of children. Leading to an incorrectly elevated inflight
      count, and then a dangling pointer within the gc_inflight_list.
      
      sockets are AF_UNIX/SOCK_STREAM
      S is an unconnected socket
      L is a listening in-flight socket bound to addr, not in fdtable
      V's fd will be passed via sendmsg(), gets inflight count bumped
      
      connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
      ----------------	-------------------------	-----------
      
      NS = unix_create1()
      skb1 = sock_wmalloc(NS)
      L = unix_find_other(addr)
      unix_state_lock(L)
      unix_peer(S) = NS
      			// V count=1 inflight=0
      
       			NS = unix_peer(S)
       			skb2 = sock_alloc()
      			skb_queue_tail(NS, skb2[V])
      
      			// V became in-flight
      			// V count=2 inflight=1
      
      			close(V)
      
      			// V count=1 inflight=1
      			// GC candidate condition met
      
      						for u in gc_inflight_list:
      						  if (total_refs == inflight_refs)
      						    add u to gc_candidates
      
      						// gc_candidates={L, V}
      
      						for u in gc_candidates:
      						  scan_children(u, dec_inflight)
      
      						// embryo (skb1) was not
      						// reachable from L yet, so V's
      						// inflight remains unchanged
      __skb_queue_tail(L, skb1)
      unix_state_unlock(L)
      						for u in gc_candidates:
      						  if (u.inflight)
      						    scan_children(u, inc_inflight_move_tail)
      
      						// V count=1 inflight=2 (!)
      
      If there is a GC-candidate listening socket, lock/unlock its state. This
      makes GC wait until the end of any ongoing connect() to that socket. After
      flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
      there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
      this point, unix_inflight() can not happen because unix_gc_lock is already
      taken. Inflight graph remains unaffected.
      
      Fixes: 1fd05ba5 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.coSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      47d8ac01
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: trap link-local frames regardless of ST Port State · 17c56011
      Arınç ÜNAL authored
      In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
      (DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
      are described; the medium access control (MAC) and logical link control
      (LLC) sublayers. The MAC sublayer is the one facing the physical layer.
      
      In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
      Bridge component comprises a MAC Relay Entity for interconnecting the Ports
      of the Bridge, at least two Ports, and higher layer entities with at least
      a Spanning Tree Protocol Entity included.
      
      Each Bridge Port also functions as an end station and shall provide the MAC
      Service to an LLC Entity. Each instance of the MAC Service is provided to a
      distinct LLC Entity that supports protocol identification, multiplexing,
      and demultiplexing, for protocol data unit (PDU) transmission and reception
      by one or more higher layer entities.
      
      It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
      Entity associated with each Bridge Port is modeled as being directly
      connected to the attached Local Area Network (LAN).
      
      On the switch with CPU port architecture, CPU port functions as Management
      Port, and the Management Port functionality is provided by software which
      functions as an end station. Software is connected to an IEEE 802 LAN that
      is wholly contained within the system that incorporates the Bridge.
      Software provides access to the LLC Entity associated with each Bridge Port
      by the value of the source port field on the special tag on the frame
      received by software.
      
      We call frames that carry control information to determine the active
      topology and current extent of each Virtual Local Area Network (VLAN),
      i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
      Registration Protocol Data Units (MVRPDUs), and frames from other link
      constrained protocols, such as Extensible Authentication Protocol over LAN
      (EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
      are not forwarded by a Bridge. Permanently configured entries in the
      filtering database (FDB) ensure that such frames are discarded by the
      Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
      detail:
      
      Each of the reserved MAC addresses specified in Table 8-1
      (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
      permanently configured in the FDB in C-VLAN components and ERs.
      
      Each of the reserved MAC addresses specified in Table 8-2
      (01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
      configured in the FDB in S-VLAN components.
      
      Each of the reserved MAC addresses specified in Table 8-3
      (01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
      in TPMR components.
      
      The FDB entries for reserved MAC addresses shall specify filtering for all
      Bridge Ports and all VIDs. Management shall not provide the capability to
      modify or remove entries for reserved MAC addresses.
      
      The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
      propagation of PDUs within a Bridged Network, as follows:
      
        The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
        no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
        component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
        PDUs transmitted using this destination address, or any other addresses
        that appear in Table 8-1, Table 8-2, and Table 8-3
        (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
        therefore travel no further than those stations that can be reached via a
        single individual LAN from the originating station.
      
        The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
        address that no conformant S-VLAN component, C-VLAN component, or MAC
        Bridge can forward; however, this address is relayed by a TPMR component.
        PDUs using this destination address, or any of the other addresses that
        appear in both Table 8-1 and Table 8-2 but not in Table 8-3
        (01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
        by any TPMRs but will propagate no further than the nearest S-VLAN
        component, C-VLAN component, or MAC Bridge.
      
        The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
        address that no conformant C-VLAN component, MAC Bridge can forward;
        however, it is relayed by TPMR components and S-VLAN components. PDUs
        using this destination address, or any of the other addresses that appear
        in Table 8-1 but not in either Table 8-2 or Table 8-3
        (01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
        S-VLAN components but will propagate no further than the nearest C-VLAN
        component or MAC Bridge.
      
      Because the LLC Entity associated with each Bridge Port is provided via CPU
      port, we must not filter these frames but forward them to CPU port.
      
      In a Bridge, the transmission Port is majorly decided by ingress and egress
      rules, FDB, and spanning tree Port State functions of the Forwarding
      Process. For link-local frames, only CPU port should be designated as
      destination port in the FDB, and the other functions of the Forwarding
      Process must not interfere with the decision of the transmission Port. We
      call this process trapping frames to CPU port.
      
      Therefore, on the switch with CPU port architecture, link-local frames must
      be trapped to CPU port, and certain link-local frames received by a Port of
      a Bridge comprising a TPMR component or an S-VLAN component must be
      excluded from it.
      
      A Bridge of the switch with CPU port architecture cannot comprise a
      Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
      subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
      (Management Port doesn't count) of this architecture will either function
      as a standard MAC Bridge or a standard VLAN Bridge.
      
      Therefore, a Bridge of this architecture can only comprise S-VLAN
      components, C-VLAN components, or MAC Bridge components. Since there's no
      TPMR component, we don't need to relay PDUs using the destination addresses
      specified on the Nearest non-TPMR section, and the proportion of the
      Nearest Customer Bridge section where they must be relayed by TPMR
      components.
      
      One option to trap link-local frames to CPU port is to add static FDB
      entries with CPU port designated as destination port. However, because that
      Independent VLAN Learning (IVL) is being used on every VID, each entry only
      applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
      Bridge component or a C-VLAN component, there would have to be 16 times
      4096 entries. This switch intellectual property can only hold a maximum of
      2048 entries. Using this option, there also isn't a mechanism to prevent
      link-local frames from being discarded when the spanning tree Port State of
      the reception Port is discarding.
      
      The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
      registers. Whilst this applies to every VID, it doesn't contain all of the
      reserved MAC addresses without affecting the remaining Standard Group MAC
      Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
      the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
      addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
      destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
      The latter option provides better but not complete conformance.
      
      This switch intellectual property also does not provide a mechanism to trap
      link-local frames with specific destination addresses to CPU port by
      Bridge, to conform to the filtering rules for the distinct Bridge
      components.
      
      Therefore, regardless of the type of the Bridge component, link-local
      frames with these destination addresses will be trapped to CPU port:
      
      01-80-C2-00-00-[00,01,02,03,0E]
      
      In a Bridge comprising a MAC Bridge component or a C-VLAN component:
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]
      
      In a Bridge comprising an S-VLAN component:
      
        Link-local frames with these destination addresses will be trapped to CPU
        port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-00
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A]
      
      Currently on this switch intellectual property, if the spanning tree Port
      State of the reception Port is discarding, link-local frames will be
      discarded.
      
      To trap link-local frames regardless of the spanning tree Port State, make
      the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
      intellectual property only lets the frames regarded as BPDUs bypass the
      spanning tree Port State function of the Forwarding Process.
      
      With this change, the only remaining interference is the ingress rules.
      When the reception Port has no PVID assigned on software, VLAN-untagged
      frames won't be allowed in. There doesn't seem to be a mechanism on the
      switch intellectual property to have link-local frames bypass this function
      of the Forwarding Process.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Reviewed-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Link: https://lore.kernel.org/r/20240409-b4-for-net-mt7530-fix-link-local-when-stp-discarding-v2-1-07b1150164ac@arinc9.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      17c56011
    • Gerd Bayer's avatar
      Revert "s390/ism: fix receive message buffer allocation" · d51dc8dd
      Gerd Bayer authored
      This reverts commit 58effa34.
      Review was not finished on this patch. So it's not ready for
      upstreaming.
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Link: https://lore.kernel.org/r/20240409113753.2181368-1-gbayer@linux.ibm.com
      Fixes: 58effa34 ("s390/ism: fix receive message buffer allocation")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d51dc8dd
    • Daniel Machon's avatar
      net: sparx5: fix wrong config being used when reconfiguring PCS · 33623113
      Daniel Machon authored
      The wrong port config is being used if the PCS is reconfigured. Fix this
      by correctly using the new config instead of the old one.
      
      Fixes: 946e7fd5 ("net: sparx5: add port module support")
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409-link-mode-reconfiguration-fix-v2-1-db6a507f3627@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      33623113
    • Arnd Bergmann's avatar
      net/mlx5: fix possible stack overflows · fe87922c
      Arnd Bergmann authored
      A couple of debug functions use a 512 byte temporary buffer and call another
      function that has another buffer of the same size, which in turn exceeds the
      usual warning limit for excessive stack usage:
      
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1073:1: error: stack frame size (1448) exceeds limit (1024) in 'dr_dump_start' [-Werror,-Wframe-larger-than]
      dr_dump_start(struct seq_file *file, loff_t *pos)
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1009:1: error: stack frame size (1120) exceeds limit (1024) in 'dr_dump_domain' [-Werror,-Wframe-larger-than]
      dr_dump_domain(struct seq_file *file, struct mlx5dr_domain *dmn)
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:705:1: error: stack frame size (1104) exceeds limit (1024) in 'dr_dump_matcher_rx_tx' [-Werror,-Wframe-larger-than]
      dr_dump_matcher_rx_tx(struct seq_file *file, bool is_rx,
      
      Rework these so that each of the various code paths only ever has one of
      these buffers in it, and exactly the functions that declare one have
      the 'noinline_for_stack' annotation that prevents them from all being
      inlined into the same caller.
      
      Fixes: 917d1e79 ("net/mlx5: DR, Change SWS usage to debug fs seq_file interface")
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/all/20240219100506.648089-1-arnd@kernel.org/Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240408074142.3007036-1-arnd@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fe87922c