1. 06 Sep, 2023 7 commits
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix bandwidth discrepancy between tc-cbs software and offload · 954ad9bf
      Vladimir Oltean authored
      More careful measurement of the tc-cbs bandwidth shows that the stream
      bandwidth (effectively idleslope) increases, there is a larger and
      larger discrepancy between the rate limit obtained by the software
      Qdisc, and the rate limit obtained by its offloaded counterpart.
      
      The discrepancy becomes so large, that e.g. at an idleslope of 40000
      (40Mbps), the offloaded cbs does not actually rate limit anything, and
      traffic will pass at line rate through a 100 Mbps port.
      
      The reason for the discrepancy is that the hardware documentation I've
      been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states
      about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which
      the credit counter is increased".
      
      Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf
      (for SJA1110), the wording is different: "This field specifies the
      value, in bytes per second times link speed, by which the credit counter
      is increased".
      
      So there's an extra scaling for link speed that the driver is currently
      not accounting for, and apparently (empirically), that link speed is
      expressed in Kbps.
      
      I've pondered whether to pollute the sja1105_mac_link_up()
      implementation with CBS shaper reprogramming, but I don't think it is
      worth it. IMO, the UAPI exposed by tc-cbs requires user space to
      recalculate the sendslope anyway, since the formula for that depends on
      port_transmit_rate (see man tc-cbs), which is not an invariant from tc's
      perspective.
      
      So we use the offload->sendslope and offload->idleslope to deduce the
      original port_transmit_rate from the CBS formula, and use that value to
      scale the offload->sendslope and offload->idleslope to values that the
      hardware understands.
      
      Some numerical data points:
      
       40Mbps stream, max interfering frame size 1500, port speed 100M
       ---------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 40000 sendslope -60000 locredit -900 hicredit 600
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    600                600
       credit_lo    900                900
       send_slope   7500000            75
       idle_slope   5000000            50
      
       40Mbps stream, max interfering frame size 1500, port speed 1G
       -------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 40000 sendslope -960000 locredit -1440 hicredit 60
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    60                 60
       credit_lo    1440               1440
       send_slope   120000000          120
       idle_slope   5000000            5
      
       5.12Mbps stream, max interfering frame size 1522, port speed 100M
       -----------------------------------------------------------------
      
       tc-cbs parameters:
       idleslope 5120 sendslope -94880 locredit -1444 hicredit 77
      
       which result in hardware values:
      
       Before (doesn't work)           After (works)
       credit_hi    77                 77
       credit_lo    1444               1444
       send_slope   11860000           118
       idle_slope   640000             6
      
      Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps.
      
      Fixes: 4d752508 ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc")
      Reported-by: default avatarYanan Yang <yanan.yang@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      954ad9bf
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · ca7cfd73
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Change MIN_TXD and MIN_RXD to allow set rx/tx value between 64 and 80
      
      Olga Zaborska says:
      
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igb, igbvf and igc devices can use as low as 64
      descriptors.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca7cfd73
    • Bodong Wang's avatar
      mlx5/core: E-Switch, Create ACL FT for eswitch manager in switchdev mode · 34413460
      Bodong Wang authored
      ACL flow table is required in switchdev mode when metadata is enabled,
      driver creates such table when loading each vport. However, not every
      vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager.
      In this case, ACL flow table is still needed.
      
      To make it modularized, create ACL flow table for eswitch manager as
      default and skip such operations when loading manager vport.
      
      Also, there is no need to load the eswitch manager vport in switchdev mode.
      This means there is no need to load it on regular connect-x HCAs where
      the PF is the eswitch manager. This will avoid creating duplicate ACL
      flow table for host PF vport.
      
      Fixes: 29bcb6e4 ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules")
      Fixes: eb8e9fae ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager")
      Fixes: 5019833d ("net/mlx5: E-switch, Introduce helper function to enable/disable vports")
      Signed-off-by: default avatarBodong Wang <bodong@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34413460
    • Jianbo Liu's avatar
      net/mlx5e: Clear mirred devices array if the rule is split · b7558a77
      Jianbo Liu authored
      In the cited commit, the mirred devices are recorded and checked while
      parsing the actions. In order to avoid system crash, the duplicate
      action in a single rule is not allowed.
      
      But the rule is actually break down into several FTEs in different
      tables, for either mirroring, or the specified types of actions which
      use post action infrastructure.
      
      It will reject certain action list by mistake, for example:
          actions:enp8s0f0_1,set(ipv4(ttl=63)),enp8s0f0_0,enp8s0f0_1.
      Here the rule is split to two FTEs because of pedit action.
      
      To fix this issue, when parsing the rule actions, reset if_count to
      clear the mirred devices array if the rule is split to multiple
      FTEs, and then the duplicate checking is restarted.
      
      Fixes: 554fe75c ("net/mlx5e: Avoid duplicating rule destinations")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7558a77
    • Eric Dumazet's avatar
      ip_tunnels: use DEV_STATS_INC() · 9b271eba
      Eric Dumazet authored
      syzbot/KCSAN reported data-races in iptunnel_xmit_stats() [1]
      
      This can run from multiple cpus without mutual exclusion.
      
      Adopt SMP safe DEV_STATS_INC() to update dev->stats fields.
      
      [1]
      BUG: KCSAN: data-race in iptunnel_xmit / iptunnel_xmit
      
      read-write to 0xffff8881353df170 of 8 bytes by task 30263 on cpu 1:
      iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline]
      iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87
      ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831
      __gre_xmit net/ipv4/ip_gre.c:469 [inline]
      ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662
      __netdev_start_xmit include/linux/netdevice.h:4889 [inline]
      netdev_start_xmit include/linux/netdevice.h:4903 [inline]
      xmit_one net/core/dev.c:3544 [inline]
      dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560
      __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340
      dev_queue_xmit include/linux/netdevice.h:3082 [inline]
      __bpf_tx_skb net/core/filter.c:2129 [inline]
      __bpf_redirect_no_mac net/core/filter.c:2159 [inline]
      __bpf_redirect+0x723/0x9c0 net/core/filter.c:2182
      ____bpf_clone_redirect net/core/filter.c:2453 [inline]
      bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425
      ___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954
      __bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195
      bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline]
      __bpf_prog_run include/linux/filter.h:609 [inline]
      bpf_prog_run include/linux/filter.h:616 [inline]
      bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423
      bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045
      bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996
      __sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353
      __do_sys_bpf kernel/bpf/syscall.c:5439 [inline]
      __se_sys_bpf kernel/bpf/syscall.c:5437 [inline]
      __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read-write to 0xffff8881353df170 of 8 bytes by task 30249 on cpu 0:
      iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline]
      iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87
      ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831
      __gre_xmit net/ipv4/ip_gre.c:469 [inline]
      ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662
      __netdev_start_xmit include/linux/netdevice.h:4889 [inline]
      netdev_start_xmit include/linux/netdevice.h:4903 [inline]
      xmit_one net/core/dev.c:3544 [inline]
      dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560
      __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340
      dev_queue_xmit include/linux/netdevice.h:3082 [inline]
      __bpf_tx_skb net/core/filter.c:2129 [inline]
      __bpf_redirect_no_mac net/core/filter.c:2159 [inline]
      __bpf_redirect+0x723/0x9c0 net/core/filter.c:2182
      ____bpf_clone_redirect net/core/filter.c:2453 [inline]
      bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425
      ___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954
      __bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195
      bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline]
      __bpf_prog_run include/linux/filter.h:609 [inline]
      bpf_prog_run include/linux/filter.h:616 [inline]
      bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423
      bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045
      bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996
      __sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353
      __do_sys_bpf kernel/bpf/syscall.c:5439 [inline]
      __se_sys_bpf kernel/bpf/syscall.c:5437 [inline]
      __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x0000000000018830 -> 0x0000000000018831
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 30249 Comm: syz-executor.4 Not tainted 6.5.0-syzkaller-11704-g3f86ed6e #0
      
      Fixes: 039f5062 ("ip_tunnel: Move stats update to iptunnel_xmit()")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b271eba
    • Taehee Yoo's avatar
      net: team: do not use dynamic lockdep key · 39285e12
      Taehee Yoo authored
      team interface has used a dynamic lockdep key to avoid false-positive
      lockdep deadlock detection. Virtual interfaces such as team usually
      have their own lock for protecting private data.
      These interfaces can be nested.
      team0
        |
      team1
      
      Each interface's lock is actually different(team0->lock and team1->lock).
      So,
      mutex_lock(&team0->lock);
      mutex_lock(&team1->lock);
      mutex_unlock(&team1->lock);
      mutex_unlock(&team0->lock);
      The above case is absolutely safe. But lockdep warns about deadlock.
      Because the lockdep understands these two locks are same. This is a
      false-positive lockdep warning.
      
      So, in order to avoid this problem, the team interfaces started to use
      dynamic lockdep key. The false-positive problem was fixed, but it
      introduced a new problem.
      
      When the new team virtual interface is created, it registers a dynamic
      lockdep key(creates dynamic lockdep key) and uses it. But there is the
      limitation of the number of lockdep keys.
      So, If so many team interfaces are created, it consumes all lockdep keys.
      Then, the lockdep stops to work and warns about it.
      
      In order to fix this problem, team interfaces use the subclass instead
      of the dynamic key. So, when a new team interface is created, it doesn't
      register(create) a new lockdep, but uses existed subclass key instead.
      It is already used by the bonding interface for a similar case.
      
      As the bonding interface does, the subclass variable is the same as
      the 'dev->nested_level'. This variable indicates the depth in the stacked
      interface graph.
      
      The 'dev->nested_level' is protected by RTNL and RCU.
      So, 'mutex_lock_nested()' for 'team->lock' requires RTNL or RCU.
      In the current code, 'team->lock' is usually acquired under RTNL, there is
      no problem with using 'dev->nested_level'.
      
      The 'team_nl_team_get()' and The 'lb_stats_refresh()' functions acquire
      'team->lock' without RTNL.
      But these don't iterate their own ports nested so they don't need nested
      lock.
      
      Reproducer:
         for i in {0..1000}
         do
                 ip link add team$i type team
                 ip link add dummy$i master team$i type dummy
                 ip link set dummy$i up
                 ip link set team$i up
         done
      
      Splat looks like:
         BUG: MAX_LOCKDEP_ENTRIES too low!
         turning off the locking correctness validator.
         Please attach the output of /proc/lock_stat to the bug report
         CPU: 0 PID: 4104 Comm: ip Not tainted 6.5.0-rc7+ #45
         Call Trace:
          <TASK>
         dump_stack_lvl+0x64/0xb0
         add_lock_to_list+0x30d/0x5e0
         check_prev_add+0x73a/0x23a0
         ...
         sock_def_readable+0xfe/0x4f0
         netlink_broadcast+0x76b/0xac0
         nlmsg_notify+0x69/0x1d0
         dev_open+0xed/0x130
         ...
      
      Reported-by: syzbot+9bbbacfbf1e04d5221f7@syzkaller.appspotmail.com
      Fixes: 369f61be ("team: fix nested locking lockdep warning")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39285e12
    • Quan Tian's avatar
      net/ipv6: SKB symmetric hash should incorporate transport ports · a5e2151f
      Quan Tian authored
      __skb_get_hash_symmetric() was added to compute a symmetric hash over
      the protocol, addresses and transport ports, by commit eb70db87
      ("packet: Use symmetric hash for PACKET_FANOUT_HASH."). It uses
      flow_keys_dissector_symmetric_keys as the flow_dissector to incorporate
      IPv4 addresses, IPv6 addresses and ports. However, it should not specify
      the flag as FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL, which stops further
      dissection when an IPv6 flow label is encountered, making transport
      ports not being incorporated in such case.
      
      As a consequence, the symmetric hash is based on 5-tuple for IPv4 but
      3-tuple for IPv6 when flow label is present. It caused a few problems,
      e.g. when nft symhash and openvswitch l4_sym rely on the symmetric hash
      to perform load balancing as different L4 flows between two given IPv6
      addresses would always get the same symmetric hash, leading to uneven
      traffic distribution.
      
      Removing the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL makes sure the
      symmetric hash is based on 5-tuple for both IPv4 and IPv6 consistently.
      
      Fixes: eb70db87 ("packet: Use symmetric hash for PACKET_FANOUT_HASH.")
      Reported-by: default avatarLars Ekman <uablrek@gmail.com>
      Closes: https://github.com/antrea-io/antrea/issues/5457Signed-off-by: default avatarQuan Tian <qtian@vmware.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5e2151f
  2. 05 Sep, 2023 8 commits
    • Olga Zaborska's avatar
      igb: Change IGB_MIN to allow set rx/tx value between 64 and 80 · 6319685b
      Olga Zaborska authored
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igb devices can use as low as 64 descriptors.
      This change will unify igb with other drivers.
      Based on commit 7b1be198 ("e1000e: lower ring minimum size to 64")
      
      Fixes: 9d5c8243 ("igb: PCI-Express 82575 Gigabit Ethernet driver")
      Signed-off-by: default avatarOlga Zaborska <olga.zaborska@intel.com>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      6319685b
    • Olga Zaborska's avatar
      igbvf: Change IGBVF_MIN to allow set rx/tx value between 64 and 80 · 83607175
      Olga Zaborska authored
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igbvf devices can use as low as 64 descriptors.
      This change will unify igbvf with other drivers.
      Based on commit 7b1be198 ("e1000e: lower ring minimum size to 64")
      
      Fixes: d4e0fe01 ("igbvf: add new driver to support 82576 virtual functions")
      Signed-off-by: default avatarOlga Zaborska <olga.zaborska@intel.com>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      83607175
    • Olga Zaborska's avatar
      igc: Change IGC_MIN to allow set rx/tx value between 64 and 80 · 5aa48279
      Olga Zaborska authored
      Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
      value between 64 and 80. All igc devices can use as low as 64 descriptors.
      This change will unify igc with other drivers.
      Based on commit 7b1be198 ("e1000e: lower ring minimum size to 64")
      
      Fixes: 0507ef8a ("igc: Add transmit and receive fastpath and interrupt handlers")
      Signed-off-by: default avatarOlga Zaborska <olga.zaborska@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      5aa48279
    • Geetha sowjanya's avatar
      octeontx2-af: Fix truncation of smq in CN10K NIX AQ enqueue mbox handler · 29fe7a1b
      Geetha sowjanya authored
      The smq value used in the CN10K NIX AQ instruction enqueue mailbox
      handler was truncated to 9-bit value from 10-bit value because of
      typecasting the CN10K mbox request structure to the CN9K structure.
      Though this hasn't caused any problems when programming the NIX SQ
      context to the HW because the context structure is the same size.
      However, this causes a problem when accessing the structure parameters.
      This patch reads the right smq value for each platform.
      
      Fixes: 30077d21 ("octeontx2-af: cn10k: Update NIX/NPA context structure")
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Kovvuri Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29fe7a1b
    • Eric Dumazet's avatar
      igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU · c3b704d4
      Eric Dumazet authored
      This is a follow up of commit 915d975b ("net: deal with integer
      overflows in kmalloc_reserve()") based on David Laight feedback.
      
      Back in 2010, I failed to realize malicious users could set dev->mtu
      to arbitrary values. This mtu has been since limited to 0x7fffffff but
      regardless of how big dev->mtu is, it makes no sense for igmpv3_newpack()
      to allocate more than IP_MAX_MTU and risk various skb fields overflows.
      
      Fixes: 57e1ab6e ("igmp: refine skb allocations")
      Link: https://lore.kernel.org/netdev/d273628df80f45428e739274ab9ecb72@AcuMS.aculab.com/Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDavid Laight <David.Laight@ACULAB.COM>
      Cc: Kyle Zeng <zengyhkyle@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3b704d4
    • Sabrina Dubroca's avatar
      Revert "net: macsec: preserve ingress frame ordering" · d3287e40
      Sabrina Dubroca authored
      This reverts commit ab046a5d.
      
      It was trying to work around an issue at the crypto layer by excluding
      ASYNC implementations of gcm(aes), because a bug in the AESNI version
      caused reordering when some requests bypassed the cryptd queue while
      older requests were still pending on the queue.
      
      This was fixed by commit 38b2f68b ("crypto: aesni - Fix cryptd
      reordering problem on gcm"), which pre-dates ab046a5d.
      
      Herbert Xu confirmed that all ASYNC implementations are expected to
      maintain the ordering of completions wrt requests, so we can use them
      in MACsec.
      
      On my test machine, this restores the performance of a single netperf
      instance, from 1.4Gbps to 4.4Gbps.
      
      Link: https://lore.kernel.org/netdev/9328d206c5d9f9239cae27e62e74de40b258471d.1692279161.git.sd@queasysnail.net/T/
      Link: https://lore.kernel.org/netdev/1b0cec71-d084-8153-2ba4-72ce71abeb65@byu.edu/
      Link: https://lore.kernel.org/netdev/d335ddaa-18dc-f9f0-17ee-9783d3b2ca29@mailbox.tu-dresden.de/
      Fixes: ab046a5d ("net: macsec: preserve ingress frame ordering")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/11c952469d114db6fb29242e1d9545e61f52f512.1693757159.git.sd@queasysnail.netSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d3287e40
    • Shigeru Yoshida's avatar
      kcm: Destroy mutex in kcm_exit_net() · 6ad40b36
      Shigeru Yoshida authored
      kcm_exit_net() should call mutex_destroy() on knet->mutex. This is especially
      needed if CONFIG_DEBUG_MUTEXES is enabled.
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Link: https://lore.kernel.org/r/20230902170708.1727999-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6ad40b36
    • valis's avatar
      net: sched: sch_qfq: Fix UAF in qfq_dequeue() · 8fc134fe
      valis authored
      When the plug qdisc is used as a class of the qfq qdisc it could trigger a
      UAF. This issue can be reproduced with following commands:
      
        tc qdisc add dev lo root handle 1: qfq
        tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
        tc qdisc add dev lo parent 1:1 handle 2: plug
        tc filter add dev lo parent 1: basic classid 1:1
        ping -c1 127.0.0.1
      
      and boom:
      
      [  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
      [  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
      [  285.355903]
      [  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
      [  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [  285.358376] Call Trace:
      [  285.358773]  <IRQ>
      [  285.359109]  dump_stack_lvl+0x44/0x60
      [  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
      [  285.360611]  kasan_report+0x10c/0x120
      [  285.361195]  ? qfq_dequeue+0xa7/0x7f0
      [  285.361780]  qfq_dequeue+0xa7/0x7f0
      [  285.362342]  __qdisc_run+0xf1/0x970
      [  285.362903]  net_tx_action+0x28e/0x460
      [  285.363502]  __do_softirq+0x11b/0x3de
      [  285.364097]  do_softirq.part.0+0x72/0x90
      [  285.364721]  </IRQ>
      [  285.365072]  <TASK>
      [  285.365422]  __local_bh_enable_ip+0x77/0x90
      [  285.366079]  __dev_queue_xmit+0x95f/0x1550
      [  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
      [  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
      [  285.368259]  ? __build_skb_around+0x129/0x190
      [  285.368960]  ? ip_generic_getfrag+0x12c/0x170
      [  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
      [  285.370390]  ? csum_partial+0x8/0x20
      [  285.370961]  ? raw_getfrag+0xe5/0x140
      [  285.371559]  ip_finish_output2+0x539/0xa40
      [  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
      [  285.372954]  ip_output+0x113/0x1e0
      [  285.373512]  ? __pfx_ip_output+0x10/0x10
      [  285.374130]  ? icmp_out_count+0x49/0x60
      [  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
      [  285.375457]  ip_push_pending_frames+0xf3/0x100
      [  285.376173]  raw_sendmsg+0xef5/0x12d0
      [  285.376760]  ? do_syscall_64+0x40/0x90
      [  285.377359]  ? __static_call_text_end+0x136578/0x136578
      [  285.378173]  ? do_syscall_64+0x40/0x90
      [  285.378772]  ? kasan_enable_current+0x11/0x20
      [  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
      [  285.380137]  ? __sock_create+0x13e/0x270
      [  285.380673]  ? __sys_socket+0xf3/0x180
      [  285.381174]  ? __x64_sys_socket+0x3d/0x50
      [  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [  285.382425]  ? __rcu_read_unlock+0x48/0x70
      [  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
      [  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
      [  285.384295]  ? preempt_count_sub+0x14/0xc0
      [  285.384844]  ? __list_del_entry_valid+0x76/0x140
      [  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
      [  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
      [  285.386645]  ? release_sock+0xa0/0xd0
      [  285.387148]  ? preempt_count_sub+0x14/0xc0
      [  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
      [  285.388341]  ? aa_sk_perm+0x177/0x390
      [  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
      [  285.389441]  ? check_stack_object+0x22/0x70
      [  285.390032]  ? inet_send_prepare+0x2f/0x120
      [  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
      [  285.391172]  sock_sendmsg+0xcc/0xe0
      [  285.391667]  __sys_sendto+0x190/0x230
      [  285.392168]  ? __pfx___sys_sendto+0x10/0x10
      [  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
      [  285.393328]  ? set_normalized_timespec64+0x57/0x70
      [  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
      [  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
      [  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
      [  285.395908]  ? _copy_to_user+0x3e/0x60
      [  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
      [  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
      [  285.397734]  ? do_syscall_64+0x71/0x90
      [  285.398258]  __x64_sys_sendto+0x74/0x90
      [  285.398786]  do_syscall_64+0x64/0x90
      [  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
      [  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
      [  285.400605]  ? do_syscall_64+0x71/0x90
      [  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [  285.401807] RIP: 0033:0x495726
      [  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
      [  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
      [  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
      [  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
      [  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
      [  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
      [  285.410403]  </TASK>
      [  285.410704]
      [  285.410929] Allocated by task 144:
      [  285.411402]  kasan_save_stack+0x1e/0x40
      [  285.411926]  kasan_set_track+0x21/0x30
      [  285.412442]  __kasan_slab_alloc+0x55/0x70
      [  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
      [  285.413567]  __alloc_skb+0x1b4/0x230
      [  285.414060]  __ip_append_data+0x17f7/0x1b60
      [  285.414633]  ip_append_data+0x97/0xf0
      [  285.415144]  raw_sendmsg+0x5a8/0x12d0
      [  285.415640]  sock_sendmsg+0xcc/0xe0
      [  285.416117]  __sys_sendto+0x190/0x230
      [  285.416626]  __x64_sys_sendto+0x74/0x90
      [  285.417145]  do_syscall_64+0x64/0x90
      [  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [  285.418306]
      [  285.418531] Freed by task 144:
      [  285.418960]  kasan_save_stack+0x1e/0x40
      [  285.419469]  kasan_set_track+0x21/0x30
      [  285.419988]  kasan_save_free_info+0x27/0x40
      [  285.420556]  ____kasan_slab_free+0x109/0x1a0
      [  285.421146]  kmem_cache_free+0x1c2/0x450
      [  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
      [  285.422333]  __netif_receive_skb_one_core+0x97/0x140
      [  285.423003]  process_backlog+0x100/0x2f0
      [  285.423537]  __napi_poll+0x5c/0x2d0
      [  285.424023]  net_rx_action+0x2be/0x560
      [  285.424510]  __do_softirq+0x11b/0x3de
      [  285.425034]
      [  285.425254] The buggy address belongs to the object at ffff8880bad31280
      [  285.425254]  which belongs to the cache skbuff_head_cache of size 224
      [  285.426993] The buggy address is located 40 bytes inside of
      [  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
      [  285.428572]
      [  285.428798] The buggy address belongs to the physical page:
      [  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
      [  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
      [  285.431447] page_type: 0xffffffff()
      [  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
      [  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [  285.433562] page dumped because: kasan: bad access detected
      [  285.434144]
      [  285.434320] Memory state around the buggy address:
      [  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  285.436777]                                   ^
      [  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  285.438126] ==================================================================
      [  285.438662] Disabling lock debugging due to kernel taint
      
      Fix this by:
      1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
      function compatible with non-work-conserving qdiscs
      2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.
      
      Fixes: 462dbc91 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
      Reported-by: default avatarvalis <sec@valis.email>
      Signed-off-by: default avatarvalis <sec@valis.email>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8fc134fe
  3. 04 Sep, 2023 13 commits
    • David S. Miller's avatar
      Merge branch 'af_unix-data-races' · 2861f09c
      David S. Miller authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix four data-races.
      
      While running syzkaller, KCSAN reported 3 data-races with
      systemd-coredump using AF_UNIX sockets.
      
      This series fixes the three and another one inspiered by
      one of the reports.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2861f09c
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data race around sk->sk_err. · b1928129
      Kuniyuki Iwashima authored
      As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be
      read locklessly by unix_dgram_sendmsg().
      
      Let's use READ_ONCE() for sk_err as well.
      
      Note that the writer side is marked by commit cc04410a ("af_unix:
      annotate lockless accesses to sk->sk_err").
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1928129
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data-races around sk->sk_shutdown. · afe8764f
      Kuniyuki Iwashima authored
      sk->sk_shutdown is changed under unix_state_lock(sk), but
      unix_dgram_sendmsg() calls two functions to read sk_shutdown locklessly.
      
        sock_alloc_send_pskb
        `- sock_wait_for_wmem
      
      Let's use READ_ONCE() there.
      
      Note that the writer side was marked by commit e1d09c2c ("af_unix:
      Fix data races around sk->sk_shutdown.").
      
      BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock
      
      write (marked) to 0xffff8880069af12c of 1 bytes by task 1 on cpu 1:
       unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
       unix_release+0x59/0x80 net/unix/af_unix.c:1053
       __sock_release+0x7d/0x170 net/socket.c:654
       sock_close+0x19/0x30 net/socket.c:1386
       __fput+0x2a3/0x680 fs/file_table.c:384
       ____fput+0x15/0x20 fs/file_table.c:412
       task_work_run+0x116/0x1a0 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
       __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
       syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
       do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      read to 0xffff8880069af12c of 1 bytes by task 28650 on cpu 0:
       sock_alloc_send_pskb+0xd2/0x620 net/core/sock.c:2767
       unix_dgram_sendmsg+0x2f8/0x14f0 net/unix/af_unix.c:1944
       unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
       unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
       sock_sendmsg_nosec net/socket.c:725 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:748
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
       __sys_sendmsg+0x94/0x140 net/socket.c:2577
       __do_sys_sendmsg net/socket.c:2586 [inline]
       __se_sys_sendmsg net/socket.c:2584 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      value changed: 0x00 -> 0x03
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 28650 Comm: systemd-coredum Not tainted 6.4.0-11989-g68433066 #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afe8764f
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data-race around unix_tot_inflight. · ade32bd8
      Kuniyuki Iwashima authored
      unix_tot_inflight is changed under spin_lock(unix_gc_lock), but
      unix_release_sock() reads it locklessly.
      
      Let's use READ_ONCE() for unix_tot_inflight.
      
      Note that the writer side was marked by commit 9d6d7f1c ("af_unix:
      annote lockless accesses to unix_tot_inflight & gc_in_progress")
      
      BUG: KCSAN: data-race in unix_inflight / unix_release_sock
      
      write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1:
       unix_inflight+0x130/0x180 net/unix/scm.c:64
       unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123
       unix_scm_to_skb net/unix/af_unix.c:1832 [inline]
       unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955
       sock_sendmsg_nosec net/socket.c:724 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:747
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2493
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2547
       __sys_sendmsg+0x94/0x140 net/socket.c:2576
       __do_sys_sendmsg net/socket.c:2585 [inline]
       __se_sys_sendmsg net/socket.c:2583 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2583
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0:
       unix_release_sock+0x608/0x910 net/unix/af_unix.c:671
       unix_release+0x59/0x80 net/unix/af_unix.c:1058
       __sock_release+0x7d/0x170 net/socket.c:653
       sock_close+0x19/0x30 net/socket.c:1385
       __fput+0x179/0x5e0 fs/file_table.c:321
       ____fput+0x15/0x20 fs/file_table.c:349
       task_work_run+0x116/0x1a0 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
       __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
       syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
       do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      value changed: 0x00000000 -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 9305cfa4 ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ade32bd8
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data-races around user->unix_inflight. · 0bc36c06
      Kuniyuki Iwashima authored
      user->unix_inflight is changed under spin_lock(unix_gc_lock),
      but too_many_unix_fds() reads it locklessly.
      
      Let's annotate the write/read accesses to user->unix_inflight.
      
      BUG: KCSAN: data-race in unix_attach_fds / unix_inflight
      
      write to 0xffffffff8546f2d0 of 8 bytes by task 44798 on cpu 1:
       unix_inflight+0x157/0x180 net/unix/scm.c:66
       unix_attach_fds+0x147/0x1e0 net/unix/scm.c:123
       unix_scm_to_skb net/unix/af_unix.c:1827 [inline]
       unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950
       unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
       unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
       sock_sendmsg_nosec net/socket.c:725 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:748
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
       __sys_sendmsg+0x94/0x140 net/socket.c:2577
       __do_sys_sendmsg net/socket.c:2586 [inline]
       __se_sys_sendmsg net/socket.c:2584 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      read to 0xffffffff8546f2d0 of 8 bytes by task 44814 on cpu 0:
       too_many_unix_fds net/unix/scm.c:101 [inline]
       unix_attach_fds+0x54/0x1e0 net/unix/scm.c:110
       unix_scm_to_skb net/unix/af_unix.c:1827 [inline]
       unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950
       unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
       unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
       sock_sendmsg_nosec net/socket.c:725 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:748
       ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
       __sys_sendmsg+0x94/0x140 net/socket.c:2577
       __do_sys_sendmsg net/socket.c:2586 [inline]
       __se_sys_sendmsg net/socket.c:2584 [inline]
       __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      value changed: 0x000000000000000c -> 0x000000000000000d
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 44814 Comm: systemd-coredum Not tainted 6.4.0-11989-g68433066 #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 712f4aad ("unix: properly account for FDs passed over unix sockets")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: default avatarWilly Tarreau <w@1wt.eu>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0bc36c06
    • Kuniyuki Iwashima's avatar
      af_unix: Fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT. · 718e6b51
      Kuniyuki Iwashima authored
      Heiko Carstens reported that SCM_PIDFD does not work with MSG_CMSG_COMPAT
      because scm_pidfd_recv() always checks msg_controllen against sizeof(struct
      cmsghdr).
      
      We need to use sizeof(struct compat_cmsghdr) for the compat case.
      
      Fixes: 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
      Reported-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Closes: https://lore.kernel.org/netdev/20230901200517.8742-A-hca@linux.ibm.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Tested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Reviewed-by: default avatarAlexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      718e6b51
    • Jakub Kicinski's avatar
      docs: netdev: update the netdev infra URLs · 52450087
      Jakub Kicinski authored
      Some corporate proxies block our current NIPA URLs because
      they use a free / shady DNS domain. As suggested by Jesse
      we got a new DNS entry from Konstantin - netdev.bots.linux.dev,
      use it.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52450087
    • Jakub Kicinski's avatar
      docs: netdev: document patchwork patch states · ee8ab74a
      Jakub Kicinski authored
      The patchwork states are largely self-explanatory but small
      ambiguities may still come up. Document how we interpret
      the states in networking.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee8ab74a
    • Oleksij Rempel's avatar
      net: phy: micrel: Correct bit assignments for phy_device flags · 719c5e37
      Oleksij Rempel authored
      Previously, the defines for phy_device flags in the Micrel driver were
      ambiguous in their representation. They were intended to be bit masks
      but were mistakenly defined as bit positions. This led to the following
      issues:
      
      - MICREL_KSZ8_P1_ERRATA, designated for KSZ88xx switches, overlapped
        with MICREL_PHY_FXEN and MICREL_PHY_50MHZ_CLK.
      - Due to this overlap, the code path for MICREL_PHY_FXEN, tailored for
        the KSZ8041 PHY, was not executed for KSZ88xx PHYs.
      - Similarly, the code associated with MICREL_PHY_50MHZ_CLK wasn't
        triggered for KSZ88xx.
      
      To rectify this, all three flags have now been explicitly converted to
      use the `BIT()` macro, ensuring they are defined as bit masks and
      preventing potential overlaps in the future.
      
      Fixes: 49011e0c ("net: phy: micrel: ksz886x/ksz8081: add cabletest support")
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      719c5e37
    • Alex Henrie's avatar
      net: ipv6/addrconf: avoid integer underflow in ipv6_create_tempaddr · f31867d0
      Alex Henrie authored
      The existing code incorrectly casted a negative value (the result of a
      subtraction) to an unsigned value without checking. For example, if
      /proc/sys/net/ipv6/conf/*/temp_prefered_lft was set to 1, the preferred
      lifetime would jump to 4 billion seconds. On my machine and network the
      shortest lifetime that avoided underflow was 3 seconds.
      
      Fixes: 76506a98 ("IPv6: fix DESYNC_FACTOR")
      Signed-off-by: default avatarAlex Henrie <alexhenrie24@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f31867d0
    • Liang Chen's avatar
      veth: Fixing transmit return status for dropped packets · 151e887d
      Liang Chen authored
      The veth_xmit function returns NETDEV_TX_OK even when packets are dropped.
      This behavior leads to incorrect calculations of statistics counts, as
      well as things like txq->trans_start updates.
      
      Fixes: e314dbdc ("[NET]: Virtual ethernet device driver.")
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      151e887d
    • Eric Dumazet's avatar
      gve: fix frag_list chaining · 817c7cd2
      Eric Dumazet authored
      gve_rx_append_frags() is able to build skbs chained with frag_list,
      like GRO engine.
      
      Problem is that shinfo->frag_list should only be used
      for the head of the chain.
      
      All other links should use skb->next pointer.
      
      Otherwise, built skbs are not valid and can cause crashes.
      
      Equivalent code in GRO (skb_gro_receive()) is:
      
          if (NAPI_GRO_CB(p)->last == p)
              skb_shinfo(p)->frag_list = skb;
          else
              NAPI_GRO_CB(p)->last->next = skb;
          NAPI_GRO_CB(p)->last = skb;
      
      Fixes: 9b8dd5e5 ("gve: DQO: Add RX path")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Bailey Forrest <bcf@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Catherine Sullivan <csully@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      817c7cd2
    • Eric Dumazet's avatar
      net: deal with integer overflows in kmalloc_reserve() · 915d975b
      Eric Dumazet authored
      Blamed commit changed:
          ptr = kmalloc(size);
          if (ptr)
            size = ksize(ptr);
      
      to:
          size = kmalloc_size_roundup(size);
          ptr = kmalloc(size);
      
      This allowed various crash as reported by syzbot [1]
      and Kyle Zeng.
      
      Problem is that if @size is bigger than 0x80000001,
      kmalloc_size_roundup(size) returns 2^32.
      
      kmalloc_reserve() uses a 32bit variable (obj_size),
      so 2^32 is truncated to 0.
      
      kmalloc(0) returns ZERO_SIZE_PTR which is not handled by
      skb allocations.
      
      Following trace can be triggered if a netdev->mtu is set
      close to 0x7fffffff
      
      We might in the future limit netdev->mtu to more sensible
      limit (like KMALLOC_MAX_SIZE).
      
      This patch is based on a syzbot report, and also a report
      and tentative fix from Kyle Zeng.
      
      [1]
      BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline]
      BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
      Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554
      
      CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
      Call trace:
      dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279
      show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286
      __dump_stack lib/dump_stack.c:88 [inline]
      dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106
      print_report+0xe4/0x4b4 mm/kasan/report.c:398
      kasan_report+0x150/0x1ac mm/kasan/report.c:495
      kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189
      memset+0x40/0x70 mm/kasan/shadow.c:44
      __build_skb_around net/core/skbuff.c:294 [inline]
      __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
      alloc_skb include/linux/skbuff.h:1316 [inline]
      igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359
      add_grec+0x81c/0x1124 net/ipv4/igmp.c:534
      igmpv3_send_cr net/ipv4/igmp.c:667 [inline]
      igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810
      call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474
      expire_timers kernel/time/timer.c:1519 [inline]
      __run_timers+0x54c/0x710 kernel/time/timer.c:1790
      run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803
      _stext+0x380/0xfbc
      ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79
      call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891
      do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84
      invoke_softirq kernel/softirq.c:437 [inline]
      __irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683
      irq_exit_rcu+0x14/0x78 kernel/softirq.c:695
      el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717
      __el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724
      el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729
      el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584
      
      Fixes: 12d6c1d3 ("skbuff: Proactively round up to kmalloc bucket size")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reported-by: default avatarKyle Zeng <zengyhkyle@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      915d975b
  4. 03 Sep, 2023 1 commit
  5. 01 Sep, 2023 11 commits
    • Edward Cree's avatar
      sfc: check for zero length in EF10 RX prefix · ae074e2b
      Edward Cree authored
      When EF10 RXDP firmware is operating in cut-through mode, packet length
       is not known at the time the RX prefix is generated, so it is left as
       zero and RX event merging is inhibited to ensure that the length is
       available in the RX event.  However, it has been found that in certain
       circumstances the RX events for these packets still get merged,
       meaning the driver cannot read the length from the RX event, and tries
       to use the length from the prefix.
      The resulting zero-length SKBs cause crashes in GRO since commit
       1d11fa69 ("net-gro: remove GRO_DROP"), so add a check to the driver
       to detect these zero-length RX events and discard the packet.
      Signed-off-by: default avatarEdward Cree <ecree.xilinx@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae074e2b
    • David S. Miller's avatar
      Merge branch 'dst-hint-multipath' · d8a30706
      David S. Miller authored
      Sriram Yagnaraman says:
      
      ====================
      Avoid TCP resets when using ECMP for load-balancing between multiple servers.
      
      All packets in the same flow (L3/L4 depending on multipath hash policy)
      should be directed to the same target, but after [0]/[1] we see stray
      packets directed towards other targets. This, for instance, causes RST
      to be sent on TCP connections.
      
      The first two patches solve the problem by ignoring route hints for
      destinations that are part of multipath group, by using new SKB flags
      for IPv4 and IPv6. The third patch is a selftest that tests the
      scenario.
      
      Thanks to Ido, for reviewing and suggesting a way forward in [2] and
      also suggesting how to write a selftest for this.
      
      v4->v5:
      - Fixed review comments from Ido
      v3->v4:
      - Remove single path test
      - Rebase to latest
      v2->v3:
      - Add NULL check for skb in fib6_select_path (Ido Schimmel)
      - Use fib_tests.sh for selftest instead of the forwarding suite (Ido
        Schimmel)
      v1->v2:
      - Update to commit messages describing the solution (Ido Schimmel)
      - Use perf stat to count fib table lookups in selftest (Ido Schimmel)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8a30706
    • Sriram Yagnaraman's avatar
      selftests: fib_tests: Add multipath list receive tests · 8ae9efb8
      Sriram Yagnaraman authored
      The test uses perf stat to count the number of fib:fib_table_lookup
      tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for
      IPv6. The measured count is checked to be within 5% of the total number
      of packets sent via veth1.
      Signed-off-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ae9efb8
    • Sriram Yagnaraman's avatar
      ipv6: ignore dst hint for multipath routes · 8423be89
      Sriram Yagnaraman authored
      Route hints when the nexthop is part of a multipath group causes packets
      in the same receive batch to be sent to the same nexthop irrespective of
      the multipath hash of the packet. So, do not extract route hint for
      packets whose destination is part of a multipath group.
      
      A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the
      flag when route is looked up in fib6_select_path() and use it in
      ip6_can_use_hint() to check for the existence of the flag.
      
      Fixes: 197dbf24 ("ipv6: introduce and uses route look hints for list input.")
      Signed-off-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8423be89
    • Sriram Yagnaraman's avatar
      ipv4: ignore dst hint for multipath routes · 6ac66cb0
      Sriram Yagnaraman authored
      Route hints when the nexthop is part of a multipath group causes packets
      in the same receive batch to be sent to the same nexthop irrespective of
      the multipath hash of the packet. So, do not extract route hint for
      packets whose destination is part of a multipath group.
      
      A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the
      flag when route is looked up in ip_mkroute_input() and use it in
      ip_extract_route_hint() to check for the existence of the flag.
      
      Fixes: 02b24941 ("ipv4: use dst hint for ipv4 list receive")
      Signed-off-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ac66cb0
    • Mohamed Khalfella's avatar
      skbuff: skb_segment, Call zero copy functions before using skbuff frags · 2ea35288
      Mohamed Khalfella authored
      Commit bf5c25d6 ("skbuff: in skb_segment, call zerocopy functions
      once per nskb") added the call to zero copy functions in skb_segment().
      The change introduced a bug in skb_segment() because skb_orphan_frags()
      may possibly change the number of fragments or allocate new fragments
      altogether leaving nrfrags and frag to point to the old values. This can
      cause a panic with stacktrace like the one below.
      
      [  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
      [  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ #26
      [  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
      [  194.021892] Call Trace:
      [  194.027422]  <TASK>
      [  194.072861]  tcp_gso_segment+0x107/0x540
      [  194.082031]  inet_gso_segment+0x15c/0x3d0
      [  194.090783]  skb_mac_gso_segment+0x9f/0x110
      [  194.095016]  __skb_gso_segment+0xc1/0x190
      [  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
      [  194.107071]  dev_qdisc_enqueue+0x16/0x70
      [  194.110884]  __dev_queue_xmit+0x63b/0xb30
      [  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
      [  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
      [  194.131787]  __dev_queue_xmit+0x8a0/0xb30
      [  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
      [  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
      [  194.144622]  sch_direct_xmit+0xe3/0x280
      [  194.147748]  __dev_queue_xmit+0x54a/0xb30
      [  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
      [  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
      [  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
      [  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
      [  194.176959]  vhost_worker+0x76/0xb0 [vhost]
      [  194.183667]  kthread+0x118/0x140
      [  194.190358]  ret_from_fork+0x1f/0x30
      [  194.193670]  </TASK>
      
      In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
      local variable in skb_segment() stale. This resulted in the code hitting
      i >= nrfrags prematurely and trying to move to next frag_skb using
      list_skb pointer, which was NULL, and caused kernel panic. Move the call
      to zero copy functions before using frags and nr_frags.
      
      Fixes: bf5c25d6 ("skbuff: in skb_segment, call zerocopy functions once per nskb")
      Signed-off-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Reported-by: default avatarAmit Goyal <agoyal@purestorage.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ea35288
    • David S. Miller's avatar
      Merge branch 'net-data-race-annotations' · f2e977f3
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: another round of data-race annotations
      
      Series inspired by some syzbot reports, taking care
      of 4 socket fields that can be read locklessly.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2e977f3
    • Eric Dumazet's avatar
      net: annotate data-races around sk->sk_bind_phc · 251cd405
      Eric Dumazet authored
      sk->sk_bind_phc is read locklessly. Add corresponding annotations.
      
      Fixes: d463126e ("net: sock: extend SO_TIMESTAMPING for PHC binding")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      251cd405
    • Eric Dumazet's avatar
      net: annotate data-races around sk->sk_tsflags · e3390b30
      Eric Dumazet authored
      sk->sk_tsflags can be read locklessly, add corresponding annotations.
      
      Fixes: b9f40e21 ("net-timestamp: move timestamp flags out of sk_flags")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3390b30
    • Eric Dumazet's avatar
      mptcp: annotate data-races around msk->rmem_fwd_alloc · 9531e4a8
      Eric Dumazet authored
      msk->rmem_fwd_alloc can be read locklessly.
      
      Add mptcp_rmem_fwd_alloc_add(), similar to sk_forward_alloc_add(),
      and appropriate READ_ONCE()/WRITE_ONCE() annotations.
      
      Fixes: 6511882c ("mptcp: allocate fwd memory separately on the rx and tx path")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9531e4a8
    • Eric Dumazet's avatar
      net: annotate data-races around sk->sk_forward_alloc · 5e6300e7
      Eric Dumazet authored
      Every time sk->sk_forward_alloc is read locklessly,
      add a READ_ONCE().
      
      Add sk_forward_alloc_add() helper to centralize updates,
      to reduce number of WRITE_ONCE().
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e6300e7