1. 07 Sep, 2023 9 commits
  2. 06 Sep, 2023 17 commits
    • Martin KaFai Lau's avatar
      selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc · a96d1cfb
      Martin KaFai Lau authored
      This patch checks the sk_omem_alloc has been uncharged by bpf_sk_storage
      during the __sk_destruct.
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-4-martin.lau@linux.dev
      a96d1cfb
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc · 55d49f75
      Martin KaFai Lau authored
      The commit c83597fa ("bpf: Refactor some inode/task/sk storage functions
      for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
      bpf_local_storage_unlink_nolock() which then later renamed to
      bpf_local_storage_destroy(). The commit accidentally passed the
      "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
      which then stopped the uncharge from happening to the sk->sk_omem_alloc.
      
      This missing uncharge only happens when the sk is going away (during
      __sk_destruct).
      
      This patch fixes it by always passing "uncharge_mem = true". It is a
      noop to the task/inode/cgroup storage because they do not have the
      map_local_storage_(un)charge enabled in the map_ops. A followup patch
      will be done in bpf-next to remove the uncharge_mem argument.
      
      A selftest is added in the next patch.
      
      Fixes: c83597fa ("bpf: Refactor some inode/task/sk storage functions for reuse")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
      55d49f75
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix invalid wait context lockdep report · a96a44ab
      Martin KaFai Lau authored
      './test_progs -t test_local_storage' reported a splat:
      
      [   27.137569] =============================
      [   27.138122] [ BUG: Invalid wait context ]
      [   27.138650] 6.5.0-03980-gd11ae1b1 #247 Tainted: G           O
      [   27.139542] -----------------------------
      [   27.140106] test_progs/1729 is trying to lock:
      [   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
      [   27.141834] other info that might help us debug this:
      [   27.142437] context-{5:5}
      [   27.142856] 2 locks held by test_progs/1729:
      [   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
      [   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
      [   27.145855] stack backtrace:
      [   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b1 #247
      [   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   27.149127] Call Trace:
      [   27.149490]  <TASK>
      [   27.149867]  dump_stack_lvl+0x130/0x1d0
      [   27.152609]  dump_stack+0x14/0x20
      [   27.153131]  __lock_acquire+0x1657/0x2220
      [   27.153677]  lock_acquire+0x1b8/0x510
      [   27.157908]  local_lock_acquire+0x29/0x130
      [   27.159048]  obj_cgroup_charge+0xf4/0x3c0
      [   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
      [   27.161931]  __kmem_cache_alloc_node+0x51/0x210
      [   27.163557]  __kmalloc+0xaa/0x210
      [   27.164593]  bpf_map_kzalloc+0xbc/0x170
      [   27.165147]  bpf_selem_alloc+0x130/0x510
      [   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
      [   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
      [   27.169199]  bpf_map_update_value+0x415/0x4f0
      [   27.169871]  map_update_elem+0x413/0x550
      [   27.170330]  __sys_bpf+0x5e9/0x640
      [   27.174065]  __x64_sys_bpf+0x80/0x90
      [   27.174568]  do_syscall_64+0x48/0xa0
      [   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [   27.175932] RIP: 0033:0x7effb40e41ad
      [   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
      [   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
      [   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
      [   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
      [   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
      [   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
      [   27.184958]  </TASK>
      
      It complains about acquiring a local_lock while holding a raw_spin_lock.
      It means it should not allocate memory while holding a raw_spin_lock
      since it is not safe for RT.
      
      raw_spin_lock is needed because bpf_local_storage supports tracing
      context. In particular for task local storage, it is easy to
      get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
      However, task (and cgroup) local storage has already been moved to
      bpf mem allocator which can be used after raw_spin_lock.
      
      The splat is for the sk storage. For sk (and inode) storage,
      it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
      kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
      However, the local storage helper requires a verifier accepted
      sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
      a bpf prog in a kzalloc unsafe context and also able to hold a verifier
      accepted sk pointer) could happen.
      
      This patch avoids kzalloc after raw_spin_lock to silent the splat.
      There is an existing kzalloc before the raw_spin_lock. At that point,
      a kzalloc is very likely required because a lookup has just been done
      before. Thus, this patch always does the kzalloc before acquiring
      the raw_spin_lock and remove the later kzalloc usage after the
      raw_spin_lock. After this change, it will have a charge and then
      uncharge during the syscall bpf_map_update_elem() code path.
      This patch opts for simplicity and not continue the old
      optimization to save one charge and uncharge.
      
      This issue is dated back to the very first commit of bpf_sk_storage
      which had been refactored multiple times to create task, inode, and
      cgroup storage. This patch uses a Fixes tag with a more recent
      commit that should be easier to do backport.
      
      Fixes: b00fa38a ("bpf: Enable non-atomic allocations in local storage")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
      a96a44ab
    • Ilya Leoshkevich's avatar
      s390/bpf: Pass through tail call counter in trampolines · a192103a
      Ilya Leoshkevich authored
      s390x eBPF programs use the following extension to the s390x calling
      convention: tail call counter is passed on stack at offset
      STK_OFF_TCCNT, which callees otherwise use as scratch space.
      
      Currently trampoline does not respect this and clobbers tail call
      counter. This breaks enforcing tail call limits in eBPF programs, which
      have trampolines attached to them.
      
      Fix by forwarding a copy of the tail call counter to the original eBPF
      program in the trampoline (for fexit), and by restoring it at the end
      of the trampoline (for fentry).
      
      Fixes: 528eb2cb ("s390/bpf: Implement arch_prepare_bpf_trampoline()")
      Reported-by: default avatarLeon Hwang <hffilwlqm@gmail.com>
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230906004448.111674-1-iii@linux.ibm.com
      a192103a
    • Sebastian Andrzej Siewior's avatar
      bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check. · 6764e767
      Sebastian Andrzej Siewior authored
      __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
      performing the recursion check which means in case of a recursion
      __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
      value.
      
      __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
      after the recursion check which means in case of a recursion
      __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
      look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
      isn't initialized upfront.
      
      Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
      set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
      Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
      a few cycles later.
      
      Fixes: e384c7b7 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
      6764e767
    • Sebastian Andrzej Siewior's avatar
      bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf(). · 7645629f
      Sebastian Andrzej Siewior authored
      If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
      0 without undoing rcu_read_lock_trace(), migrate_disable() or
      decrementing the recursion counter. This is fine in the JIT case because
      the JIT code will jump in the 0 case to the end and invoke the matching
      exit trampoline (__bpf_prog_exit_sleepable_recur()).
      
      This is not the case in kern_sys_bpf() which returns directly to the
      caller with an error code.
      
      Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.
      
      Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
      7645629f
    • Jakub Kicinski's avatar
      net: phylink: fix sphinx complaint about invalid literal · 1a961e74
      Jakub Kicinski authored
      sphinx complains about the use of "%PHYLINK_PCS_NEG_*":
      
      Documentation/networking/kapi:144: ./include/linux/phylink.h:601: WARNING: Inline literal start-string without end-string.
      Documentation/networking/kapi:144: ./include/linux/phylink.h:633: WARNING: Inline literal start-string without end-string.
      
      These are not valid symbols so drop the '%' prefix.
      
      Alternatively we could use %PHYLINK_PCS_NEG_\* (escape the *)
      or use normal literal ``PHYLINK_PCS_NEG_*`` but there is already
      a handful of un-adorned DEFINE_* in this file.
      
      Fixes: f99d471a ("net: phylink: add PCS negotiation mode")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a961e74
    • David S. Miller's avatar
      Merge branch 'sja1105-fixes' · f8fdd54e
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      tc-cbs offload fixes for SJA1105 DSA
      
      Yanan Yang has pointed out to me that certain tc-cbs offloaded
      configurations do not appear to do any shaping on the LS1021A-TSN board
      (SJA1105T).
      
      This is due to an apparent documentation error that also made its way
      into the driver, which patch 1/3 now fixes.
      
      While investigating and then testing, I've found 2 more bugs, which are
      patches 2/3 and 3/3.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8fdd54e
    • Vladimir Oltean's avatar
      net: dsa: sja1105: complete tc-cbs offload support on SJA1110 · 180a7419
      Vladimir Oltean authored
      The blamed commit left this delta behind:
      
        struct sja1105_cbs_entry {
       -	u64 port;
       -	u64 prio;
       +	u64 port; /* Not used for SJA1110 */
       +	u64 prio; /* Not used for SJA1110 */
        	u64 credit_hi;
        	u64 credit_lo;
        	u64 send_slope;
        	u64 idle_slope;
        };
      
      but did not actually implement tc-cbs offload fully for the new switch.
      The offload is accepted, but it doesn't work.
      
      The difference compared to earlier switch generations is that now, the
      table of CBS shapers is sparse, because there are many more shapers, so
      the mapping between a {port, prio} and a table index is static, rather
      than requiring us to store the port and prio into the sja1105_cbs_entry.
      
      So, the problem is that the code programs the CBS shaper parameters at a
      dynamic table index which is incorrect.
      
      All that needs to be done for SJA1110 CBS shapers to work is to bypass
      the logic which allocates shapers in a dense manner, as for SJA1105, and
      use the fixed mapping instead.
      
      Fixes: 3e77e59b ("net: dsa: sja1105: add support for the SJA1110 switch family")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      180a7419
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix -ENOSPC when replacing the same tc-cbs too many times · 894cafc5
      Vladimir Oltean authored
      After running command [2] too many times in a row:
      
      [1] $ tc qdisc add dev sw2p0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      [2] $ tc qdisc replace dev sw2p0 parent 1:1 cbs offload 1 \
      	idleslope 120000 sendslope -880000 locredit -1320 hicredit 180
      
      (aka more than priv->info->num_cbs_shapers times)
      
      we start seeing the following error message:
      
      Error: Specified device failed to setup cbs hardware offload.
      
      This comes from the fact that ndo_setup_tc(TC_SETUP_QDISC_CBS) presents
      the same API for the qdisc create and replace cases, and the sja1105
      driver fails to distinguish between the 2. Thus, it always thinks that
      it must allocate the same shaper for a {port, queue} pair, when it may
      instead have to replace an existing one.
      
      Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      894cafc5
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix bandwidth discrepancy between tc-cbs software and offload · 954ad9bf
      Vladimir Oltean authored
      More careful measurement of the tc-cbs bandwidth shows that the stream
      bandwidth (effectively idleslope) increases, there is a larger and
      larger discrepancy between the rate limit obtained by the software
      Qdisc, and the rate limit obtained by its offloaded counterpart.
      
      The discrepancy becomes so large, that e.g. at an idleslope of 40000
      (40Mbps), the offloaded cbs does not actually rate limit anything, and
      traffic will pass at line rate through a 100 Mbps port.
      
      The reason for the discrepancy is that the hardware documentation I've
      been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states
      about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which
      the credit counter is increased".
      
      Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf
      (for SJA1110), the wording is different: "This field specifies the
      value, in bytes per second times link speed, by which the credit counter
      is increased".
      
      So there's an extra scaling for link speed that the driver is currently
      not accounting for, and apparently (empirically), that link speed is
      expressed in Kbps.
      
      I've pondered whether to pollute the sja1105_mac_link_up()
      implementation with CBS shaper reprogramming, but I don't think it is
      worth it. IMO, the UAPI exposed by tc-cbs requires user space to
      recalculate the sendslope anyway, since the formula for that depends on
      port_transmit_rate (see man tc-cbs), which is not an invariant from tc's
      perspective.
      
      So we use the offload->sendslope and offload->idleslope to deduce the
      original port_transmit_rate from the CBS formula, and use that value to
      scale the offload->sendslope and offload->idleslope to values that the
      hardware understands.
      
      Some numerical data points:
      
       40Mbps stream, max interfering frame size 1500, port speed 100M
       ---------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 40000 sendslope -60000 locredit -900 hicredit 600
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    600                600
       credit_lo    900                900
       send_slope   7500000            75
       idle_slope   5000000            50
      
       40Mbps stream, max interfering frame size 1500, port speed 1G
       -------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 40000 sendslope -960000 locredit -1440 hicredit 60
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    60                 60
       credit_lo    1440               1440
       send_slope   120000000          120
       idle_slope   5000000            5
      
       5.12Mbps stream, max interfering frame size 1522, port speed 100M
       -----------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 5120 sendslope -94880 locredit -1444 hicredit 77
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    77                 77
       credit_lo    1444               1444
       send_slope   11860000           118
       idle_slope   640000             6
      
      Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps.
      
      Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
      Reported-by: default avatarYanan Yang <yanan.yang@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      954ad9bf
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · ca7cfd73
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Change MIN_TXD and MIN_RXD to allow set rx/tx value between 64 and 80
      
      Olga Zaborska says:
      
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igb, igbvf and igc devices can use as low as 64
      descriptors.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca7cfd73
    • Bodong Wang's avatar
      mlx5/core: E-Switch, Create ACL FT for eswitch manager in switchdev mode · 34413460
      Bodong Wang authored
      ACL flow table is required in switchdev mode when metadata is enabled,
      driver creates such table when loading each vport. However, not every
      vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager.
      In this case, ACL flow table is still needed.
      
      To make it modularized, create ACL flow table for eswitch manager as
      default and skip such operations when loading manager vport.
      
      Also, there is no need to load the eswitch manager vport in switchdev mode.
      This means there is no need to load it on regular connect-x HCAs where
      the PF is the eswitch manager. This will avoid creating duplicate ACL
      flow table for host PF vport.
      
      Fixes: 29bcb6e4 ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules")
      Fixes: eb8e9fae ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager")
      Fixes: 5019833d ("net/mlx5: E-switch, Introduce helper function to enable/disable vports")
      Signed-off-by: default avatarBodong Wang <bodong@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34413460
    • Jianbo Liu's avatar
      net/mlx5e: Clear mirred devices array if the rule is split · b7558a77
      Jianbo Liu authored
      In the cited commit, the mirred devices are recorded and checked while
      parsing the actions. In order to avoid system crash, the duplicate
      action in a single rule is not allowed.
      
      But the rule is actually break down into several FTEs in different
      tables, for either mirroring, or the specified types of actions which
      use post action infrastructure.
      
      It will reject certain action list by mistake, for example:
          actions:enp8s0f0_1,set(ipv4(ttl=63)),enp8s0f0_0,enp8s0f0_1.
      Here the rule is split to two FTEs because of pedit action.
      
      To fix this issue, when parsing the rule actions, reset if_count to
      clear the mirred devices array if the rule is split to multiple
      FTEs, and then the duplicate checking is restarted.
      
      Fixes: 554fe75c ("net/mlx5e: Avoid duplicating rule destinations")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7558a77
    • Eric Dumazet's avatar
      ip_tunnels: use DEV_STATS_INC() · 9b271eba
      Eric Dumazet authored
      syzbot/KCSAN reported data-races in iptunnel_xmit_stats() [1]
      
      This can run from multiple cpus without mutual exclusion.
      
      Adopt SMP safe DEV_STATS_INC() to update dev->stats fields.
      
      [1]
      BUG: KCSAN: data-race in iptunnel_xmit / iptunnel_xmit
      
      read-write to 0xffff8881353df170 of 8 bytes by task 30263 on cpu 1:
      iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline]
      iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87
      ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831
      __gre_xmit net/ipv4/ip_gre.c:469 [inline]
      ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662
      __netdev_start_xmit include/linux/netdevice.h:4889 [inline]
      netdev_start_xmit include/linux/netdevice.h:4903 [inline]
      xmit_one net/core/dev.c:3544 [inline]
      dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560
      __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340
      dev_queue_xmit include/linux/netdevice.h:3082 [inline]
      __bpf_tx_skb net/core/filter.c:2129 [inline]
      __bpf_redirect_no_mac net/core/filter.c:2159 [inline]
      __bpf_redirect+0x723/0x9c0 net/core/filter.c:2182
      ____bpf_clone_redirect net/core/filter.c:2453 [inline]
      bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425
      ___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954
      __bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195
      bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline]
      __bpf_prog_run include/linux/filter.h:609 [inline]
      bpf_prog_run include/linux/filter.h:616 [inline]
      bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423
      bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045
      bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996
      __sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353
      __do_sys_bpf kernel/bpf/syscall.c:5439 [inline]
      __se_sys_bpf kernel/bpf/syscall.c:5437 [inline]
      __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read-write to 0xffff8881353df170 of 8 bytes by task 30249 on cpu 0:
      iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline]
      iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87
      ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831
      __gre_xmit net/ipv4/ip_gre.c:469 [inline]
      ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662
      __netdev_start_xmit include/linux/netdevice.h:4889 [inline]
      netdev_start_xmit include/linux/netdevice.h:4903 [inline]
      xmit_one net/core/dev.c:3544 [inline]
      dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560
      __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340
      dev_queue_xmit include/linux/netdevice.h:3082 [inline]
      __bpf_tx_skb net/core/filter.c:2129 [inline]
      __bpf_redirect_no_mac net/core/filter.c:2159 [inline]
      __bpf_redirect+0x723/0x9c0 net/core/filter.c:2182
      ____bpf_clone_redirect net/core/filter.c:2453 [inline]
      bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425
      ___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954
      __bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195
      bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline]
      __bpf_prog_run include/linux/filter.h:609 [inline]
      bpf_prog_run include/linux/filter.h:616 [inline]
      bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423
      bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045
      bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996
      __sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353
      __do_sys_bpf kernel/bpf/syscall.c:5439 [inline]
      __se_sys_bpf kernel/bpf/syscall.c:5437 [inline]
      __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x0000000000018830 -> 0x0000000000018831
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 30249 Comm: syz-executor.4 Not tainted 6.5.0-syzkaller-11704-g3f86ed6e #0
      
      Fixes: 039f5062 ("ip_tunnel: Move stats update to iptunnel_xmit()")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b271eba
    • Taehee Yoo's avatar
      net: team: do not use dynamic lockdep key · 39285e12
      Taehee Yoo authored
      team interface has used a dynamic lockdep key to avoid false-positive
      lockdep deadlock detection. Virtual interfaces such as team usually
      have their own lock for protecting private data.
      These interfaces can be nested.
      team0
        |
      team1
      
      Each interface's lock is actually different(team0->lock and team1->lock).
      So,
      mutex_lock(&team0->lock);
      mutex_lock(&team1->lock);
      mutex_unlock(&team1->lock);
      mutex_unlock(&team0->lock);
      The above case is absolutely safe. But lockdep warns about deadlock.
      Because the lockdep understands these two locks are same. This is a
      false-positive lockdep warning.
      
      So, in order to avoid this problem, the team interfaces started to use
      dynamic lockdep key. The false-positive problem was fixed, but it
      introduced a new problem.
      
      When the new team virtual interface is created, it registers a dynamic
      lockdep key(creates dynamic lockdep key) and uses it. But there is the
      limitation of the number of lockdep keys.
      So, If so many team interfaces are created, it consumes all lockdep keys.
      Then, the lockdep stops to work and warns about it.
      
      In order to fix this problem, team interfaces use the subclass instead
      of the dynamic key. So, when a new team interface is created, it doesn't
      register(create) a new lockdep, but uses existed subclass key instead.
      It is already used by the bonding interface for a similar case.
      
      As the bonding interface does, the subclass variable is the same as
      the 'dev->nested_level'. This variable indicates the depth in the stacked
      interface graph.
      
      The 'dev->nested_level' is protected by RTNL and RCU.
      So, 'mutex_lock_nested()' for 'team->lock' requires RTNL or RCU.
      In the current code, 'team->lock' is usually acquired under RTNL, there is
      no problem with using 'dev->nested_level'.
      
      The 'team_nl_team_get()' and The 'lb_stats_refresh()' functions acquire
      'team->lock' without RTNL.
      But these don't iterate their own ports nested so they don't need nested
      lock.
      
      Reproducer:
         for i in {0..1000}
         do
                 ip link add team$i type team
                 ip link add dummy$i master team$i type dummy
                 ip link set dummy$i up
                 ip link set team$i up
         done
      
      Splat looks like:
         BUG: MAX_LOCKDEP_ENTRIES too low!
         turning off the locking correctness validator.
         Please attach the output of /proc/lock_stat to the bug report
         CPU: 0 PID: 4104 Comm: ip Not tainted 6.5.0-rc7+ #45
         Call Trace:
          <TASK>
         dump_stack_lvl+0x64/0xb0
         add_lock_to_list+0x30d/0x5e0
         check_prev_add+0x73a/0x23a0
         ...
         sock_def_readable+0xfe/0x4f0
         netlink_broadcast+0x76b/0xac0
         nlmsg_notify+0x69/0x1d0
         dev_open+0xed/0x130
         ...
      
      Reported-by: syzbot+9bbbacfbf1e04d5221f7@syzkaller.appspotmail.com
      Fixes: 369f61be ("team: fix nested locking lockdep warning")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39285e12
    • Quan Tian's avatar
      net/ipv6: SKB symmetric hash should incorporate transport ports · a5e2151f
      Quan Tian authored
      __skb_get_hash_symmetric() was added to compute a symmetric hash over
      the protocol, addresses and transport ports, by commit eb70db87
      ("packet: Use symmetric hash for PACKET_FANOUT_HASH."). It uses
      flow_keys_dissector_symmetric_keys as the flow_dissector to incorporate
      IPv4 addresses, IPv6 addresses and ports. However, it should not specify
      the flag as FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL, which stops further
      dissection when an IPv6 flow label is encountered, making transport
      ports not being incorporated in such case.
      
      As a consequence, the symmetric hash is based on 5-tuple for IPv4 but
      3-tuple for IPv6 when flow label is present. It caused a few problems,
      e.g. when nft symhash and openvswitch l4_sym rely on the symmetric hash
      to perform load balancing as different L4 flows between two given IPv6
      addresses would always get the same symmetric hash, leading to uneven
      traffic distribution.
      
      Removing the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL makes sure the
      symmetric hash is based on 5-tuple for both IPv4 and IPv6 consistently.
      
      Fixes: eb70db87 ("packet: Use symmetric hash for PACKET_FANOUT_HASH.")
      Reported-by: default avatarLars Ekman <uablrek@gmail.com>
      Closes: https://github.com/antrea-io/antrea/issues/5457Signed-off-by: default avatarQuan Tian <qtian@vmware.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5e2151f
  3. 05 Sep, 2023 8 commits
    • Olga Zaborska's avatar
      igb: Change IGB_MIN to allow set rx/tx value between 64 and 80 · 6319685b
      Olga Zaborska authored
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igb devices can use as low as 64 descriptors.
      This change will unify igb with other drivers.
      Based on commit 7b1be198 ("e1000e: lower ring minimum size to 64")
      
      Fixes: 9d5c8243 ("igb: PCI-Express 82575 Gigabit Ethernet driver")
      Signed-off-by: default avatarOlga Zaborska <olga.zaborska@intel.com>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      6319685b
    • Olga Zaborska's avatar
      igbvf: Change IGBVF_MIN to allow set rx/tx value between 64 and 80 · 83607175
      Olga Zaborska authored
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igbvf devices can use as low as 64 descriptors.
      This change will unify igbvf with other drivers.
      Based on commit 7b1be198 ("e1000e: lower ring minimum size to 64")
      
      Fixes: d4e0fe01 ("igbvf: add new driver to support 82576 virtual functions")
      Signed-off-by: default avatarOlga Zaborska <olga.zaborska@intel.com>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      83607175
    • Olga Zaborska's avatar
      igc: Change IGC_MIN to allow set rx/tx value between 64 and 80 · 5aa48279
      Olga Zaborska authored
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igc devices can use as low as 64 descriptors.
      This change will unify igc with other drivers.
      Based on commit 7b1be198 ("e1000e: lower ring minimum size to 64")
      
      Fixes: 0507ef8a ("igc: Add transmit and receive fastpath and interrupt handlers")
      Signed-off-by: default avatarOlga Zaborska <olga.zaborska@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      5aa48279
    • Geetha sowjanya's avatar
      octeontx2-af: Fix truncation of smq in CN10K NIX AQ enqueue mbox handler · 29fe7a1b
      Geetha sowjanya authored
      The smq value used in the CN10K NIX AQ instruction enqueue mailbox
      handler was truncated to 9-bit value from 10-bit value because of
      typecasting the CN10K mbox request structure to the CN9K structure.
      Though this hasn't caused any problems when programming the NIX SQ
      context to the HW because the context structure is the same size.
      However, this causes a problem when accessing the structure parameters.
      This patch reads the right smq value for each platform.
      
      Fixes: 30077d21 ("octeontx2-af: cn10k: Update NIX/NPA context structure")
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Kovvuri Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29fe7a1b
    • Eric Dumazet's avatar
      igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU · c3b704d4
      Eric Dumazet authored
      This is a follow up of commit 915d975b ("net: deal with integer
      overflows in kmalloc_reserve()") based on David Laight feedback.
      
      Back in 2010, I failed to realize malicious users could set dev->mtu
      to arbitrary values. This mtu has been since limited to 0x7fffffff but
      regardless of how big dev->mtu is, it makes no sense for igmpv3_newpack()
      to allocate more than IP_MAX_MTU and risk various skb fields overflows.
      
      Fixes: 57e1ab6e ("igmp: refine skb allocations")
      Link: https://lore.kernel.org/netdev/d273628df80f45428e739274ab9ecb72@AcuMS.aculab.com/Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDavid Laight <David.Laight@ACULAB.COM>
      Cc: Kyle Zeng <zengyhkyle@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3b704d4
    • Sabrina Dubroca's avatar
      Revert "net: macsec: preserve ingress frame ordering" · d3287e40
      Sabrina Dubroca authored
      This reverts commit ab046a5d.
      
      It was trying to work around an issue at the crypto layer by excluding
      ASYNC implementations of gcm(aes), because a bug in the AESNI version
      caused reordering when some requests bypassed the cryptd queue while
      older requests were still pending on the queue.
      
      This was fixed by commit 38b2f68b ("crypto: aesni - Fix cryptd
      reordering problem on gcm"), which pre-dates ab046a5d.
      
      Herbert Xu confirmed that all ASYNC implementations are expected to
      maintain the ordering of completions wrt requests, so we can use them
      in MACsec.
      
      On my test machine, this restores the performance of a single netperf
      instance, from 1.4Gbps to 4.4Gbps.
      
      Link: https://lore.kernel.org/netdev/9328d206c5d9f9239cae27e62e74de40b258471d.1692279161.git.sd@queasysnail.net/T/
      Link: https://lore.kernel.org/netdev/1b0cec71-d084-8153-2ba4-72ce71abeb65@byu.edu/
      Link: https://lore.kernel.org/netdev/d335ddaa-18dc-f9f0-17ee-9783d3b2ca29@mailbox.tu-dresden.de/
      Fixes: ab046a5d ("net: macsec: preserve ingress frame ordering")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/11c952469d114db6fb29242e1d9545e61f52f512.1693757159.git.sd@queasysnail.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d3287e40
    • Shigeru Yoshida's avatar
      kcm: Destroy mutex in kcm_exit_net() · 6ad40b36
      Shigeru Yoshida authored
      kcm_exit_net() should call mutex_destroy() on knet->mutex. This is especially
      needed if CONFIG_DEBUG_MUTEXES is enabled.
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Link: https://lore.kernel.org/r/20230902170708.1727999-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6ad40b36
    • valis's avatar
      net: sched: sch_qfq: Fix UAF in qfq_dequeue() · 8fc134fe
      valis authored
      When the plug qdisc is used as a class of the qfq qdisc it could trigger a
      UAF. This issue can be reproduced with following commands:
      
        tc qdisc add dev lo root handle 1: qfq
        tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
        tc qdisc add dev lo parent 1:1 handle 2: plug
        tc filter add dev lo parent 1: basic classid 1:1
        ping -c1 127.0.0.1
      
      and boom:
      
      [  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
      [  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
      [  285.355903]
      [  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
      [  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [  285.358376] Call Trace:
      [  285.358773]  <IRQ>
      [  285.359109]  dump_stack_lvl+0x44/0x60
      [  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
      [  285.360611]  kasan_report+0x10c/0x120
      [  285.361195]  ? qfq_dequeue+0xa7/0x7f0
      [  285.361780]  qfq_dequeue+0xa7/0x7f0
      [  285.362342]  __qdisc_run+0xf1/0x970
      [  285.362903]  net_tx_action+0x28e/0x460
      [  285.363502]  __do_softirq+0x11b/0x3de
      [  285.364097]  do_softirq.part.0+0x72/0x90
      [  285.364721]  </IRQ>
      [  285.365072]  <TASK>
      [  285.365422]  __local_bh_enable_ip+0x77/0x90
      [  285.366079]  __dev_queue_xmit+0x95f/0x1550
      [  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
      [  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
      [  285.368259]  ? __build_skb_around+0x129/0x190
      [  285.368960]  ? ip_generic_getfrag+0x12c/0x170
      [  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
      [  285.370390]  ? csum_partial+0x8/0x20
      [  285.370961]  ? raw_getfrag+0xe5/0x140
      [  285.371559]  ip_finish_output2+0x539/0xa40
      [  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
      [  285.372954]  ip_output+0x113/0x1e0
      [  285.373512]  ? __pfx_ip_output+0x10/0x10
      [  285.374130]  ? icmp_out_count+0x49/0x60
      [  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
      [  285.375457]  ip_push_pending_frames+0xf3/0x100
      [  285.376173]  raw_sendmsg+0xef5/0x12d0
      [  285.376760]  ? do_syscall_64+0x40/0x90
      [  285.377359]  ? __static_call_text_end+0x136578/0x136578
      [  285.378173]  ? do_syscall_64+0x40/0x90
      [  285.378772]  ? kasan_enable_current+0x11/0x20
      [  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
      [  285.380137]  ? __sock_create+0x13e/0x270
      [  285.380673]  ? __sys_socket+0xf3/0x180
      [  285.381174]  ? __x64_sys_socket+0x3d/0x50
      [  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [  285.382425]  ? __rcu_read_unlock+0x48/0x70
      [  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
      [  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
      [  285.384295]  ? preempt_count_sub+0x14/0xc0
      [  285.384844]  ? __list_del_entry_valid+0x76/0x140
      [  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
      [  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
      [  285.386645]  ? release_sock+0xa0/0xd0
      [  285.387148]  ? preempt_count_sub+0x14/0xc0
      [  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
      [  285.388341]  ? aa_sk_perm+0x177/0x390
      [  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
      [  285.389441]  ? check_stack_object+0x22/0x70
      [  285.390032]  ? inet_send_prepare+0x2f/0x120
      [  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
      [  285.391172]  sock_sendmsg+0xcc/0xe0
      [  285.391667]  __sys_sendto+0x190/0x230
      [  285.392168]  ? __pfx___sys_sendto+0x10/0x10
      [  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
      [  285.393328]  ? set_normalized_timespec64+0x57/0x70
      [  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
      [  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
      [  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
      [  285.395908]  ? _copy_to_user+0x3e/0x60
      [  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
      [  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
      [  285.397734]  ? do_syscall_64+0x71/0x90
      [  285.398258]  __x64_sys_sendto+0x74/0x90
      [  285.398786]  do_syscall_64+0x64/0x90
      [  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
      [  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
      [  285.400605]  ? do_syscall_64+0x71/0x90
      [  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [  285.401807] RIP: 0033:0x495726
      [  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
      [  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
      [  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
      [  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
      [  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
      [  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
      [  285.410403]  </TASK>
      [  285.410704]
      [  285.410929] Allocated by task 144:
      [  285.411402]  kasan_save_stack+0x1e/0x40
      [  285.411926]  kasan_set_track+0x21/0x30
      [  285.412442]  __kasan_slab_alloc+0x55/0x70
      [  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
      [  285.413567]  __alloc_skb+0x1b4/0x230
      [  285.414060]  __ip_append_data+0x17f7/0x1b60
      [  285.414633]  ip_append_data+0x97/0xf0
      [  285.415144]  raw_sendmsg+0x5a8/0x12d0
      [  285.415640]  sock_sendmsg+0xcc/0xe0
      [  285.416117]  __sys_sendto+0x190/0x230
      [  285.416626]  __x64_sys_sendto+0x74/0x90
      [  285.417145]  do_syscall_64+0x64/0x90
      [  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [  285.418306]
      [  285.418531] Freed by task 144:
      [  285.418960]  kasan_save_stack+0x1e/0x40
      [  285.419469]  kasan_set_track+0x21/0x30
      [  285.419988]  kasan_save_free_info+0x27/0x40
      [  285.420556]  ____kasan_slab_free+0x109/0x1a0
      [  285.421146]  kmem_cache_free+0x1c2/0x450
      [  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
      [  285.422333]  __netif_receive_skb_one_core+0x97/0x140
      [  285.423003]  process_backlog+0x100/0x2f0
      [  285.423537]  __napi_poll+0x5c/0x2d0
      [  285.424023]  net_rx_action+0x2be/0x560
      [  285.424510]  __do_softirq+0x11b/0x3de
      [  285.425034]
      [  285.425254] The buggy address belongs to the object at ffff8880bad31280
      [  285.425254]  which belongs to the cache skbuff_head_cache of size 224
      [  285.426993] The buggy address is located 40 bytes inside of
      [  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
      [  285.428572]
      [  285.428798] The buggy address belongs to the physical page:
      [  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
      [  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
      [  285.431447] page_type: 0xffffffff()
      [  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
      [  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [  285.433562] page dumped because: kasan: bad access detected
      [  285.434144]
      [  285.434320] Memory state around the buggy address:
      [  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  285.436777]                                   ^
      [  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  285.438126] ==================================================================
      [  285.438662] Disabling lock debugging due to kernel taint
      
      Fix this by:
      1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
      function compatible with non-work-conserving qdiscs
      2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.
      
      Fixes: 462dbc91 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
      Reported-by: default avatarvalis <sec@valis.email>
      Signed-off-by: default avatarvalis <sec@valis.email>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8fc134fe
  4. 04 Sep, 2023 6 commits
    • David S. Miller's avatar
      Merge branch 'af_unix-data-races' · 2861f09c
      David S. Miller authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix four data-races.
      
      While running syzkaller, KCSAN reported 3 data-races with
      systemd-coredump using AF_UNIX sockets.
      
      This series fixes the three and another one inspiered by
      one of the reports.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2861f09c
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data race around sk->sk_err. · b1928129
      Kuniyuki Iwashima authored
      As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be
      read locklessly by unix_dgram_sendmsg().
      
      Let's use READ_ONCE() for sk_err as well.
      
      Note that the writer side is marked by commit cc04410a ("af_unix:
      annotate lockless accesses to sk->sk_err").
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1928129
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data-races around sk->sk_shutdown. · afe8764f
      Kuniyuki Iwashima authored
      sk->sk_shutdown is changed under unix_state_lock(sk), but
      unix_dgram_sendmsg() calls two functions to read sk_shutdown locklessly.
      
        sock_alloc_send_pskb
        `- sock_wait_for_wmem
      
      Let's use READ_ONCE() there.
      
      Note that the writer side was marked by commit e1d09c2c ("af_unix:
      Fix data races around sk->sk_shutdown.").
      
      BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock
      
      write (marked) to 0xffff8880069af12c of 1 bytes by task 1 on cpu 1:
       unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
       unix_release+0x59/0x80 net/unix/af_unix.c:1053
       __sock_release+0x7d/0x170 net/socket.c:654
       sock_close+0x19/0x30 net/socket.c:1386
       __fput+0x2a3/0x680 fs/file_table.c:384
       ____fput+0x15/0x20 fs/file_table.c:412
       task_work_run+0x116/0x1a0 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
       __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
       syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
       do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      read to 0xffff8880069af12c of 1 bytes by task 28650 on cpu 0:
       sock_alloc_send_pskb+0xd2/0x620 net/core/sock.c:2767
       unix_dgram_sendmsg+0x2f8/0x14f0 net/unix/af_unix.c:1944
       unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
       unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
       sock_sendmsg_nosec net/socket.c:725 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:748
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
       __sys_sendmsg+0x94/0x140 net/socket.c:2577
       __do_sys_sendmsg net/socket.c:2586 [inline]
       __se_sys_sendmsg net/socket.c:2584 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      value changed: 0x00 -> 0x03
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 28650 Comm: systemd-coredum Not tainted 6.4.0-11989-g68433066 #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afe8764f
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data-race around unix_tot_inflight. · ade32bd8
      Kuniyuki Iwashima authored
      unix_tot_inflight is changed under spin_lock(unix_gc_lock), but
      unix_release_sock() reads it locklessly.
      
      Let's use READ_ONCE() for unix_tot_inflight.
      
      Note that the writer side was marked by commit 9d6d7f1c ("af_unix:
      annote lockless accesses to unix_tot_inflight & gc_in_progress")
      
      BUG: KCSAN: data-race in unix_inflight / unix_release_sock
      
      write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1:
       unix_inflight+0x130/0x180 net/unix/scm.c:64
       unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123
       unix_scm_to_skb net/unix/af_unix.c:1832 [inline]
       unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955
       sock_sendmsg_nosec net/socket.c:724 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:747
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2493
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2547
       __sys_sendmsg+0x94/0x140 net/socket.c:2576
       __do_sys_sendmsg net/socket.c:2585 [inline]
       __se_sys_sendmsg net/socket.c:2583 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2583
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0:
       unix_release_sock+0x608/0x910 net/unix/af_unix.c:671
       unix_release+0x59/0x80 net/unix/af_unix.c:1058
       __sock_release+0x7d/0x170 net/socket.c:653
       sock_close+0x19/0x30 net/socket.c:1385
       __fput+0x179/0x5e0 fs/file_table.c:321
       ____fput+0x15/0x20 fs/file_table.c:349
       task_work_run+0x116/0x1a0 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
       __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
       syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
       do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      value changed: 0x00000000 -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 9305cfa4 ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ade32bd8
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data-races around user->unix_inflight. · 0bc36c06
      Kuniyuki Iwashima authored
      user->unix_inflight is changed under spin_lock(unix_gc_lock),
      but too_many_unix_fds() reads it locklessly.
      
      Let's annotate the write/read accesses to user->unix_inflight.
      
      BUG: KCSAN: data-race in unix_attach_fds / unix_inflight
      
      write to 0xffffffff8546f2d0 of 8 bytes by task 44798 on cpu 1:
       unix_inflight+0x157/0x180 net/unix/scm.c:66
       unix_attach_fds+0x147/0x1e0 net/unix/scm.c:123
       unix_scm_to_skb net/unix/af_unix.c:1827 [inline]
       unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950
       unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
       unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
       sock_sendmsg_nosec net/socket.c:725 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:748
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
       __sys_sendmsg+0x94/0x140 net/socket.c:2577
       __do_sys_sendmsg net/socket.c:2586 [inline]
       __se_sys_sendmsg net/socket.c:2584 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      read to 0xffffffff8546f2d0 of 8 bytes by task 44814 on cpu 0:
       too_many_unix_fds net/unix/scm.c:101 [inline]
       unix_attach_fds+0x54/0x1e0 net/unix/scm.c:110
       unix_scm_to_skb net/unix/af_unix.c:1827 [inline]
       unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950
       unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
       unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
       sock_sendmsg_nosec net/socket.c:725 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:748
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
       __sys_sendmsg+0x94/0x140 net/socket.c:2577
       __do_sys_sendmsg net/socket.c:2586 [inline]
       __se_sys_sendmsg net/socket.c:2584 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      value changed: 0x000000000000000c -> 0x000000000000000d
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 44814 Comm: systemd-coredum Not tainted 6.4.0-11989-g68433066 #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 712f4aad ("unix: properly account for FDs passed over unix sockets")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: default avatarWilly Tarreau <w@1wt.eu>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0bc36c06
    • Kuniyuki Iwashima's avatar
      af_unix: Fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT. · 718e6b51
      Kuniyuki Iwashima authored
      Heiko Carstens reported that SCM_PIDFD does not work with MSG_CMSG_COMPAT
      because scm_pidfd_recv() always checks msg_controllen against sizeof(struct
      cmsghdr).
      
      We need to use sizeof(struct compat_cmsghdr) for the compat case.
      
      Fixes: 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
      Reported-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Closes: https://lore.kernel.org/netdev/20230901200517.8742-A-hca@linux.ibm.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Tested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Reviewed-by: default avatarAlexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      718e6b51