1. 08 Aug, 2023 9 commits
    • Yonglong Liu's avatar
      net: hns3: fix deadlock issue when externel_lb and reset are executed together · ac6257a3
      Yonglong Liu authored
      When externel_lb and reset are executed together, a deadlock may
      occur:
      [ 3147.217009] INFO: task kworker/u321:0:7 blocked for more than 120 seconds.
      [ 3147.230483] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3147.238999] task:kworker/u321:0  state:D stack:    0 pid:    7 ppid:     2 flags:0x00000008
      [ 3147.248045] Workqueue: hclge hclge_service_task [hclge]
      [ 3147.253957] Call trace:
      [ 3147.257093]  __switch_to+0x7c/0xbc
      [ 3147.261183]  __schedule+0x338/0x6f0
      [ 3147.265357]  schedule+0x50/0xe0
      [ 3147.269185]  schedule_preempt_disabled+0x18/0x24
      [ 3147.274488]  __mutex_lock.constprop.0+0x1d4/0x5dc
      [ 3147.279880]  __mutex_lock_slowpath+0x1c/0x30
      [ 3147.284839]  mutex_lock+0x50/0x60
      [ 3147.288841]  rtnl_lock+0x20/0x2c
      [ 3147.292759]  hclge_reset_prepare+0x68/0x90 [hclge]
      [ 3147.298239]  hclge_reset_subtask+0x88/0xe0 [hclge]
      [ 3147.303718]  hclge_reset_service_task+0x84/0x120 [hclge]
      [ 3147.309718]  hclge_service_task+0x2c/0x70 [hclge]
      [ 3147.315109]  process_one_work+0x1d0/0x490
      [ 3147.319805]  worker_thread+0x158/0x3d0
      [ 3147.324240]  kthread+0x108/0x13c
      [ 3147.328154]  ret_from_fork+0x10/0x18
      
      In externel_lb process, the hns3 driver call napi_disable()
      first, then the reset happen, then the restore process of the
      externel_lb will fail, and will not call napi_enable(). When
      doing externel_lb again, napi_disable() will be double call,
      cause a deadlock of rtnl_lock().
      
      This patch use the HNS3_NIC_STATE_DOWN state to protect the
      calling of napi_disable() and napi_enable() in externel_lb
      process, just as the usage in ndo_stop() and ndo_start().
      
      Fixes: 04b6ba14 ("net: hns3: add support for external loopback test")
      Signed-off-by: default avatarYonglong Liu <liuyonglong@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230807113452.474224-5-shaojijie@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac6257a3
    • Jie Wang's avatar
      net: hns3: add wait until mac link down · 6265e242
      Jie Wang authored
      In some configure flow of hns3 driver, for example, change mtu, it will
      disable MAC through firmware before configuration. But firmware disables
      MAC asynchronously. The rx traffic may be not stopped in this case.
      
      So fixes it by waiting until mac link is down.
      
      Fixes: a9775bb6 ("net: hns3: fix set and get link ksettings issue")
      Signed-off-by: default avatarJie Wang <wangjie125@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230807113452.474224-4-shaojijie@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6265e242
    • Jie Wang's avatar
      net: hns3: refactor hclge_mac_link_status_wait for interface reuse · 08469dac
      Jie Wang authored
      Some nic configurations could only be performed after link is down. So this
      patch refactor this API for reuse.
      Signed-off-by: default avatarJie Wang <wangjie125@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230807113452.474224-3-shaojijie@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      08469dac
    • Jian Shen's avatar
      net: hns3: restore user pause configure when disable autoneg · 15159ec0
      Jian Shen authored
      Restore the mac pause state to user configuration when autoneg is disabled
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarPeiyang Wang <wangpeiyang1@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230807113452.474224-2-shaojijie@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      15159ec0
    • David Rheinsberg's avatar
      net/unix: use consistent error code in SO_PEERPIDFD · b6f79e82
      David Rheinsberg authored
      Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
      rather than ESRCH if a socket type does not support remote peer-PID
      queries.
      
      Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
      not an AF_UNIX socket. This is quite unexpected, given that one would
      assume ESRCH means the peer process already exited and thus cannot be
      found. However, in that case the sockopt actually returns EINVAL (via
      pidfd_prepare()). This is rather inconsistent with other syscalls, which
      usually return ESRCH if a given PID refers to a non-existant process.
      
      This changes SO_PEERPIDFD to return ENODATA instead. This is also what
      SO_PEERGROUPS returns, and thus keeps a consistent behavior across
      sockopts.
      
      Note that this code is returned in 2 cases: First, if the socket type is
      not AF_UNIX, and secondly if the socket was not yet connected. In both
      cases ENODATA seems suitable.
      Signed-off-by: default avatarDavid Rheinsberg <david@readahead.eu>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Acked-by: default avatarLuca Boccassi <bluca@debian.org>
      Fixes: 7b26952a ("net: core: add getsockopt SO_PEERPIDFD")
      Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.euSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b6f79e82
    • Claudiu Beznea's avatar
      MAINTAINERS: update Claudiu Beznea's email address · fa40ea27
      Claudiu Beznea authored
      Update MAINTAINERS entries with a valid email address as the Microchip
      one is no longer valid.
      Acked-by: default avatarConor Dooley <conor.dooley@microchip.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarClaudiu Beznea <claudiu.beznea@tuxon.dev>
      Acked-by: default avatarSebastian Reichel <sre@kernel.org>
      Link: https://lore.kernel.org/r/20230804050007.235799-1-claudiu.beznea@tuxon.devSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa40ea27
    • Jonas Gorski's avatar
      net: marvell: prestera: fix handling IPv4 routes with nhid · 2aa71b4b
      Jonas Gorski authored
      Fix handling IPv4 routes referencing a nexthop via its id by replacing
      calls to fib_info_nh() with fib_info_nhc().
      
      Trying to add an IPv4 route referencing a nextop via nhid:
      
          $ ip link set up swp5
          $ ip a a 10.0.0.1/24 dev swp5
          $ ip nexthop add dev swp5 id 20 via 10.0.0.2
          $ ip route add 10.0.1.0/24 nhid 20
      
      triggers warnings when trying to handle the route:
      
      [  528.805763] ------------[ cut here ]------------
      [  528.810437] WARNING: CPU: 3 PID: 53 at include/net/nexthop.h:468 __prestera_fi_is_direct+0x2c/0x68 [prestera]
      [  528.820434] Modules linked in: prestera_pci act_gact act_police sch_ingress cls_u32 cls_flower prestera arm64_delta_tn48m_dn_led(O) arm64_delta_tn48m_dn_cpld(O) [last unloaded: prestera_pci]
      [  528.837485] CPU: 3 PID: 53 Comm: kworker/u8:3 Tainted: G           O       6.4.5 #1
      [  528.845178] Hardware name: delta,tn48m-dn (DT)
      [  528.849641] Workqueue: prestera_ordered __prestera_router_fib_event_work [prestera]
      [  528.857352] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  528.864347] pc : __prestera_fi_is_direct+0x2c/0x68 [prestera]
      [  528.870135] lr : prestera_k_arb_fib_evt+0xb20/0xd50 [prestera]
      [  528.876007] sp : ffff80000b20bc90
      [  528.879336] x29: ffff80000b20bc90 x28: 0000000000000000 x27: ffff0001374d3a48
      [  528.886510] x26: ffff000105604000 x25: ffff000134af8a28 x24: ffff0001374d3800
      [  528.893683] x23: ffff000101c89148 x22: ffff000101c89000 x21: ffff000101c89200
      [  528.900855] x20: ffff00013641fda0 x19: ffff800009d01088 x18: 0000000000000059
      [  528.908027] x17: 0000000000000277 x16: 0000000000000000 x15: 0000000000000000
      [  528.915198] x14: 0000000000000003 x13: 00000000000fe400 x12: 0000000000000000
      [  528.922371] x11: 0000000000000002 x10: 0000000000000aa0 x9 : ffff8000013d2020
      [  528.929543] x8 : 0000000000000018 x7 : 000000007b1703f8 x6 : 000000001ca72f86
      [  528.936715] x5 : 0000000033399ea7 x4 : 0000000000000000 x3 : ffff0001374d3acc
      [  528.943886] x2 : 0000000000000000 x1 : ffff00010200de00 x0 : ffff000134ae3f80
      [  528.951058] Call trace:
      [  528.953516]  __prestera_fi_is_direct+0x2c/0x68 [prestera]
      [  528.958952]  __prestera_router_fib_event_work+0x100/0x158 [prestera]
      [  528.965348]  process_one_work+0x208/0x488
      [  528.969387]  worker_thread+0x4c/0x430
      [  528.973068]  kthread+0x120/0x138
      [  528.976313]  ret_from_fork+0x10/0x20
      [  528.979909] ---[ end trace 0000000000000000 ]---
      [  528.984998] ------------[ cut here ]------------
      [  528.989645] WARNING: CPU: 3 PID: 53 at include/net/nexthop.h:468 __prestera_fi_is_direct+0x2c/0x68 [prestera]
      [  528.999628] Modules linked in: prestera_pci act_gact act_police sch_ingress cls_u32 cls_flower prestera arm64_delta_tn48m_dn_led(O) arm64_delta_tn48m_dn_cpld(O) [last unloaded: prestera_pci]
      [  529.016676] CPU: 3 PID: 53 Comm: kworker/u8:3 Tainted: G        W  O       6.4.5 #1
      [  529.024368] Hardware name: delta,tn48m-dn (DT)
      [  529.028830] Workqueue: prestera_ordered __prestera_router_fib_event_work [prestera]
      [  529.036539] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  529.043533] pc : __prestera_fi_is_direct+0x2c/0x68 [prestera]
      [  529.049318] lr : __prestera_k_arb_fc_apply+0x280/0x2f8 [prestera]
      [  529.055452] sp : ffff80000b20bc60
      [  529.058781] x29: ffff80000b20bc60 x28: 0000000000000000 x27: ffff0001374d3a48
      [  529.065953] x26: ffff000105604000 x25: ffff000134af8a28 x24: ffff0001374d3800
      [  529.073126] x23: ffff000101c89148 x22: ffff000101c89148 x21: ffff00013641fda0
      [  529.080299] x20: ffff000101c89000 x19: ffff000101c89020 x18: 0000000000000059
      [  529.087471] x17: 0000000000000277 x16: 0000000000000000 x15: 0000000000000000
      [  529.094642] x14: 0000000000000003 x13: 00000000000fe400 x12: 0000000000000000
      [  529.101814] x11: 0000000000000002 x10: 0000000000000aa0 x9 : ffff8000013cee80
      [  529.108985] x8 : 0000000000000018 x7 : 000000007b1703f8 x6 : 0000000000000018
      [  529.116157] x5 : 00000000d3497eb6 x4 : ffff000105604081 x3 : 000000008e979557
      [  529.123329] x2 : 0000000000000000 x1 : ffff00010200de00 x0 : ffff000134ae3f80
      [  529.130501] Call trace:
      [  529.132958]  __prestera_fi_is_direct+0x2c/0x68 [prestera]
      [  529.138394]  prestera_k_arb_fib_evt+0x6b8/0xd50 [prestera]
      [  529.143918]  __prestera_router_fib_event_work+0x100/0x158 [prestera]
      [  529.150313]  process_one_work+0x208/0x488
      [  529.154348]  worker_thread+0x4c/0x430
      [  529.158030]  kthread+0x120/0x138
      [  529.161274]  ret_from_fork+0x10/0x20
      [  529.164867] ---[ end trace 0000000000000000 ]---
      
      and results in a non offloaded route:
      
          $ ip route
          10.0.0.0/24 dev swp5 proto kernel scope link src 10.0.0.1 rt_trap
          10.0.1.0/24 nhid 20 via 10.0.0.2 dev swp5 rt_trap
      
      When creating a route referencing a nexthop via its ID, the nexthop will
      be stored in a separate nh pointer instead of the array of nexthops in
      the fib_info struct. This causes issues since fib_info_nh() only handles
      the nexthops array, but not the separate nh pointer, and will loudly
      WARN about it.
      
      In contrast fib_info_nhc() handles both, but returns a fib_nh_common
      pointer instead of a fib_nh pointer. Luckily we only ever access fields
      from the fib_nh_common parts, so we can just replace all instances of
      fib_info_nh() with fib_info_nhc() and access the fields via their
      fib_nh_common names.
      
      This allows handling IPv4 routes with an external nexthop, and they now
      get offloaded as expected:
      
          $ ip route
          10.0.0.0/24 dev swp5 proto kernel scope link src 10.0.0.1 rt_trap
          10.0.1.0/24 nhid 20 via 10.0.0.2 dev swp5 offload rt_offload
      
      Fixes: 396b80cb ("net: marvell: prestera: Add neighbour cache accounting")
      Signed-off-by: default avatarJonas Gorski <jonas.gorski@bisdn.de>
      Acked-by: default avatarElad Nachman <enachman@marvell.com>
      Link: https://lore.kernel.org/r/20230804101220.247515-1-jonas.gorski@bisdn.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2aa71b4b
    • Andrew Kanner's avatar
      net: core: remove unnecessary frame_sz check in bpf_xdp_adjust_tail() · d14eea09
      Andrew Kanner authored
      Syzkaller reported the following issue:
      =======================================
      Too BIG xdp->frame_sz = 131072
      WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
        ____bpf_xdp_adjust_tail net/core/filter.c:4121 [inline]
      WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
        bpf_xdp_adjust_tail+0x466/0xa10 net/core/filter.c:4103
      ...
      Call Trace:
       <TASK>
       bpf_prog_4add87e5301a4105+0x1a/0x1c
       __bpf_prog_run include/linux/filter.h:600 [inline]
       bpf_prog_run_xdp include/linux/filter.h:775 [inline]
       bpf_prog_run_generic_xdp+0x57e/0x11e0 net/core/dev.c:4721
       netif_receive_generic_xdp net/core/dev.c:4807 [inline]
       do_xdp_generic+0x35c/0x770 net/core/dev.c:4866
       tun_get_user+0x2340/0x3ca0 drivers/net/tun.c:1919
       tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2043
       call_write_iter include/linux/fs.h:1871 [inline]
       new_sync_write fs/read_write.c:491 [inline]
       vfs_write+0x650/0xe40 fs/read_write.c:584
       ksys_write+0x12f/0x250 fs/read_write.c:637
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      xdp->frame_sz > PAGE_SIZE check was introduced in commit c8741e2b
      ("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper
      Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the
      xdp_init_buff() which all XDP driver use - it's safe to remove this
      check. The original intend was to catch cases where XDP drivers have
      not been updated to use xdp.frame_sz, but that is not longer a concern
      (since xdp_init_buff).
      
      Running the initial syzkaller repro it was discovered that the
      contiguous physical memory allocation is used for both xdp paths in
      tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also
      stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can
      work on higher order pages, as long as this is contiguous physical
      memory (e.g. a page).
      
      Reported-and-tested-by: syzbot+f817490f5bd20541b90a@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/all/000000000000774b9205f1d8a80d@google.com/T/
      Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90a
      Link: https://lore.kernel.org/all/20230725155403.796-1-andrew.kanner@gmail.com/T/
      Fixes: 43b5169d ("net, xdp: Introduce xdp_init_buff utility routine")
      Signed-off-by: default avatarAndrew Kanner <andrew.kanner@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20230803190316.2380231-1-andrew.kanner@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d14eea09
    • Andrew Kanner's avatar
      drivers: net: prevent tun_build_skb() to exceed the packet size limit · 59eeb232
      Andrew Kanner authored
      Using the syzkaller repro with reduced packet size it was discovered
      that XDP_PACKET_HEADROOM is not checked in tun_can_build_skb(),
      although pad may be incremented in tun_build_skb(). This may end up
      with exceeding the PAGE_SIZE limit in tun_build_skb().
      
      Jason Wang <jasowang@redhat.com> proposed to count XDP_PACKET_HEADROOM
      always (e.g. without rcu_access_pointer(tun->xdp_prog)) in
      tun_can_build_skb() since there's a window during which XDP program
      might be attached between tun_can_build_skb() and tun_build_skb().
      
      Fixes: 7df13219 ("tun: reserve extra headroom only when XDP is set")
      Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90aSigned-off-by: default avatarAndrew Kanner <andrew.kanner@gmail.com>
      Link: https://lore.kernel.org/r/20230803185947.2379988-1-andrew.kanner@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      59eeb232
  2. 07 Aug, 2023 3 commits
    • Jakub Kicinski's avatar
      Merge branch 'wireguard-fixes-for-6-5-rc6' · fa41884c
      Jakub Kicinski authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 6.5-rc6
      
      Just one patch this time, somewhat late in the cycle:
      
      1) Fix an off-by-one calculation for the maximum node depth size in the
         allowedips trie data structure, and also adjust the self-tests to hit
         this case so it doesn't regress again in the future.
      ====================
      
      Link: https://lore.kernel.org/r/20230807132146.2191597-1-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa41884c
    • Jason A. Donenfeld's avatar
      wireguard: allowedips: expand maximum node depth · 46622219
      Jason A. Donenfeld authored
      In the allowedips self-test, nodes are inserted into the tree, but it
      generated an even amount of nodes, but for checking maximum node depth,
      there is of course the root node, which makes the total number
      necessarily odd. With two few nodes added, it never triggered the
      maximum depth check like it should have. So, add 129 nodes instead of
      128 nodes, and do so with a more straightforward scheme, starting with
      all the bits set, and shifting over one each time. Then increase the
      maximum depth to 129, and choose a better name for that variable to
      make it clear that it represents depth as opposed to bits.
      
      Cc: stable@vger.kernel.org
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://lore.kernel.org/r/20230807132146.2191597-2-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      46622219
    • Ziyang Xuan's avatar
      bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves · 01f4fd27
      Ziyang Xuan authored
      BUG_ON(!vlan_info) is triggered in unregister_vlan_dev() with
      following testcase:
      
        # ip netns add ns1
        # ip netns exec ns1 ip link add bond0 type bond mode 0
        # ip netns exec ns1 ip link add bond_slave_1 type veth peer veth2
        # ip netns exec ns1 ip link set bond_slave_1 master bond0
        # ip netns exec ns1 ip link add link bond_slave_1 name vlan10 type vlan id 10 protocol 802.1ad
        # ip netns exec ns1 ip link add link bond0 name bond0_vlan10 type vlan id 10 protocol 802.1ad
        # ip netns exec ns1 ip link set bond_slave_1 nomaster
        # ip netns del ns1
      
      The logical analysis of the problem is as follows:
      
      1. create ETH_P_8021AD protocol vlan10 for bond_slave_1:
      register_vlan_dev()
        vlan_vid_add()
          vlan_info_alloc()
          __vlan_vid_add() // add [ETH_P_8021AD, 10] vid to bond_slave_1
      
      2. create ETH_P_8021AD protocol bond0_vlan10 for bond0:
      register_vlan_dev()
        vlan_vid_add()
          __vlan_vid_add()
            vlan_add_rx_filter_info()
                if (!vlan_hw_filter_capable(dev, proto)) // condition established because bond0 without NETIF_F_HW_VLAN_STAG_FILTER
                    return 0;
      
                if (netif_device_present(dev))
                    return dev->netdev_ops->ndo_vlan_rx_add_vid(dev, proto, vid); // will be never called
                    // The slaves of bond0 will not refer to the [ETH_P_8021AD, 10] vid.
      
      3. detach bond_slave_1 from bond0:
      __bond_release_one()
        vlan_vids_del_by_dev()
          list_for_each_entry(vid_info, &vlan_info->vid_list, list)
              vlan_vid_del(dev, vid_info->proto, vid_info->vid);
              // bond_slave_1 [ETH_P_8021AD, 10] vid will be deleted.
              // bond_slave_1->vlan_info will be assigned NULL.
      
      4. delete vlan10 during delete ns1:
      default_device_exit_batch()
        dev->rtnl_link_ops->dellink() // unregister_vlan_dev() for vlan10
          vlan_info = rtnl_dereference(real_dev->vlan_info); // real_dev of vlan10 is bond_slave_1
      	BUG_ON(!vlan_info); // bond_slave_1->vlan_info is NULL now, bug is triggered!!!
      
      Add S-VLAN tag related features support to bond driver. So the bond driver
      will always propagate the VLAN info to its slaves.
      
      Fixes: 8ad227ff ("net: vlan: add 802.1ad support")
      Suggested-by: default avatarIdo Schimmel <idosch@idosch.org>
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20230802114320.4156068-1-william.xuanziyang@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      01f4fd27
  3. 06 Aug, 2023 4 commits
    • Nitya Sunkad's avatar
      ionic: Add missing err handling for queue reconfig · 52417a95
      Nitya Sunkad authored
      ionic_start_queues_reconfig returns an error code if txrx_init fails.
      Handle this error code in the relevant places.
      
      This fixes a corner case where the device could get left in a detached
      state if the CMB reconfig fails and the attempt to clean up the mess
      also fails. Note that calling netif_device_attach when the netdev is
      already attached does not lead to unexpected behavior.
      
      Change goto name "errout" to "err_out" to maintain consistency across
      goto statements.
      
      Fixes: 40bc471d ("ionic: add tx/rx-push support with device Component Memory Buffers")
      Fixes: 6f7d6f0f ("ionic: pull reset_queues into tx_timeout handler")
      Signed-off-by: default avatarNitya Sunkad <nitya.sunkad@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52417a95
    • Fedor Pchelkin's avatar
      drivers: vxlan: vnifilter: free percpu vni stats on error path · b1c936e9
      Fedor Pchelkin authored
      In case rhashtable_lookup_insert_fast() fails inside vxlan_vni_add(), the
      allocated percpu vni stats are not freed on the error path.
      
      Introduce vxlan_vni_free() which would work as a nice wrapper to free
      vxlan_vni_node resources properly.
      
      Found by Linux Verification Center (linuxtesting.org).
      
      Fixes: 4095e0e1 ("drivers: vxlan: vnifilter: per vni stats")
      Suggested-by: default avatarIdo Schimmel <idosch@idosch.org>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1c936e9
    • Eric Dumazet's avatar
      macsec: use DEV_STATS_INC() · 32d0a49d
      Eric Dumazet authored
      syzbot/KCSAN reported data-races in macsec whenever dev->stats fields
      are updated.
      
      It appears all of these updates can happen from multiple cpus.
      
      Adopt SMP safe DEV_STATS_INC() to update dev->stats fields.
      
      Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32d0a49d
    • Jakub Kicinski's avatar
      net: tls: avoid discarding data on record close · 6b47808f
      Jakub Kicinski authored
      TLS records end with a 16B tag. For TLS device offload we only
      need to make space for this tag in the stream, the device will
      generate and replace it with the actual calculated tag.
      
      Long time ago the code would just re-reference the head frag
      which mostly worked but was suboptimal because it prevented TCP
      from combining the record into a single skb frag. I'm not sure
      if it was correct as the first frag may be shorter than the tag.
      
      The commit under fixes tried to replace that with using the page
      frag and if the allocation failed rolling back the data, if record
      was long enough. It achieves better fragment coalescing but is
      also buggy.
      
      We don't roll back the iterator, so unless we're at the end of
      send we'll skip the data we designated as tag and start the
      next record as if the rollback never happened.
      There's also the possibility that the record was constructed
      with MSG_MORE and the data came from a different syscall and
      we already told the user space that we "got it".
      
      Allocate a single dummy page and use it as fallback.
      
      Found by code inspection, and proven by forcing allocation
      failures.
      
      Fixes: e7b159a4 ("net/tls: remove the record tail optimization")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b47808f
  4. 05 Aug, 2023 10 commits
    • Eric Dumazet's avatar
      dccp: fix data-race around dp->dccps_mss_cache · a47e598f
      Eric Dumazet authored
      dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket.
      Same thing in do_dccp_getsockopt().
      
      Add READ_ONCE()/WRITE_ONCE() annotations,
      and change dccp_sendmsg() to check again dccps_mss_cache
      after socket is locked.
      
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20230803163021.2958262-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a47e598f
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-more-fixes-for-v6-5' · fc2ea6ab
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: more fixes for v6.5
      
      Here is a new batch of fixes related to MPTCP for v6.5 and older.
      
      Patches 1 and 2 fix issues with MPTCP Join selftest when manually
      launched with '-i' parameter to use 'ip mptcp' tool instead of the
      dedicated one (pm_nl_ctl). The issues have been there since v5.18.
      
      Thank you Andrea for your first contributions to MPTCP code in the
      upstream kernel!
      
      Patch 3 avoids corrupting the data stream when trying to reset
      connections that have fallen back to TCP. This can happen from v6.1.
      
      Patch 4 fixes a race when doing a disconnect() and an accept() in
      parallel on a listener socket. The issue only happens in rare cases if
      the user is really unlucky since a fix that landed in v6.3 but
      backported up to v6.1.
      ====================
      
      Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-0-6671b1ab11cc@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fc2ea6ab
    • Paolo Abeni's avatar
      mptcp: fix disconnect vs accept race · 511b90e3
      Paolo Abeni authored
      Despite commit 0ad529d9 ("mptcp: fix possible divide by zero in
      recvmsg()"), the mptcp protocol is still prone to a race between
      disconnect() (or shutdown) and accept.
      
      The root cause is that the mentioned commit checks the msk-level
      flag, but mptcp_stream_accept() does acquire the msk-level lock,
      as it can rely directly on the first subflow lock.
      
      As reported by Christoph than can lead to a race where an msk
      socket is accepted after that mptcp_subflow_queue_clean() releases
      the listener socket lock and just before it takes destructive
      actions leading to the following splat:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000012
      PGD 5a4ca067 P4D 5a4ca067 PUD 37d4c067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      CPU: 2 PID: 10955 Comm: syz-executor.5 Not tainted 6.5.0-rc1-gdc7b257ee5dd #37
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:mptcp_stream_accept+0x1ee/0x2f0 include/net/inet_sock.h:330
      Code: 0a 09 00 48 8b 1b 4c 39 e3 74 07 e8 bc 7c 7f fe eb a1 e8 b5 7c 7f fe 4c 8b 6c 24 08 eb 05 e8 a9 7c 7f fe 49 8b 85 d8 09 00 00 <0f> b6 40 12 88 44 24 07 0f b6 6c 24 07 bf 07 00 00 00 89 ee e8 89
      RSP: 0018:ffffc90000d07dc0 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: ffff888037e8d020 RCX: ffff88803b093300
      RDX: 0000000000000000 RSI: ffffffff833822c5 RDI: ffffffff8333896a
      RBP: 0000607f82031520 R08: ffff88803b093300 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000003e83 R12: ffff888037e8d020
      R13: ffff888037e8c680 R14: ffff888009af7900 R15: ffff888009af6880
      FS:  00007fc26d708640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000012 CR3: 0000000066bc5001 CR4: 0000000000370ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       do_accept+0x1ae/0x260 net/socket.c:1872
       __sys_accept4+0x9b/0x110 net/socket.c:1913
       __do_sys_accept4 net/socket.c:1954 [inline]
       __se_sys_accept4 net/socket.c:1951 [inline]
       __x64_sys_accept4+0x20/0x30 net/socket.c:1951
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      
      Address the issue by temporary removing the pending request socket
      from the accept queue, so that racing accept() can't touch them.
      
      After depleting the msk - the ssk still exists, as plain TCP sockets,
      re-insert them into the accept queue, so that later inet_csk_listen_stop()
      will complete the tcp socket disposal.
      
      Fixes: 2a6a870e ("mptcp: stops worker on unaccepted sockets at listener close")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/423Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-4-6671b1ab11cc@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      511b90e3
    • Paolo Abeni's avatar
      mptcp: avoid bogus reset on fallback close · ff18f9ef
      Paolo Abeni authored
      Since the blamed commit, the MPTCP protocol unconditionally sends
      TCP resets on all the subflows on disconnect().
      
      That fits full-blown MPTCP sockets - to implement the fastclose
      mechanism - but causes unexpected corruption of the data stream,
      caught as sporadic self-tests failures.
      
      Fixes: d21f8348 ("mptcp: use fastclose on more edge scenarios")
      Cc: stable@vger.kernel.org
      Tested-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/419Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-3-6671b1ab11cc@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ff18f9ef
    • Andrea Claudi's avatar
      selftests: mptcp: join: fix 'implicit EP' test · c8c101ae
      Andrea Claudi authored
      mptcp_join 'implicit EP' test currently fails when using ip mptcp:
      
        $ ./mptcp_join.sh -iI
        <snip>
        001 implicit EP    creation[fail] expected '10.0.2.2 10.0.2.2 id 1 implicit' found '10.0.2.2 id 1 rawflags 10 '
        Error: too many addresses or duplicate one: -22.
                           ID change is prevented[fail] expected '10.0.2.2 10.0.2.2 id 1 implicit' found '10.0.2.2 id 1 rawflags 10 '
                           modif is allowed[fail] expected '10.0.2.2 10.0.2.2 id 1 signal' found '10.0.2.2 id 1 signal '
      
      This happens because of two reasons:
      - iproute v6.3.0 does not support the implicit flag, fixed with
        iproute2-next commit 3a2535a41854 ("mptcp: add support for implicit
        flag")
      - pm_nl_check_endpoint wrongly expects the ip address to be repeated two
        times in iproute output, and does not account for a final whitespace
        in it.
      
      This fixes the issue trimming the whitespace in the output string and
      removing the double address in the expected string.
      
      Fixes: 69c6ce7b ("selftests: mptcp: add implicit endpoint test case")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrea Claudi <aclaudi@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-2-6671b1ab11cc@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c8c101ae
    • Andrea Claudi's avatar
      selftests: mptcp: join: fix 'delete and re-add' test · aaf2123a
      Andrea Claudi authored
      mptcp_join 'delete and re-add' test fails when using ip mptcp:
      
        $ ./mptcp_join.sh -iI
        <snip>
        002 delete and re-add                    before delete[ ok ]
                                                 mptcp_info subflows=1         [ ok ]
        Error: argument "ADDRESS" is wrong: invalid for non-zero id address
                                                 after delete[fail] got 2:2 subflows expected 1
      
      This happens because endpoint delete includes an ip address while id is
      not 0, contrary to what is indicated in the ip mptcp man page:
      
      "When used with the delete id operation, an IFADDR is only included when
      the ID is 0."
      
      This fixes the issue using the $addr variable in pm_nl_del_endpoint()
      only when id is 0.
      
      Fixes: 34aa6e3b ("selftests: mptcp: add ip mptcp wrappers")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrea Claudi <aclaudi@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-1-6671b1ab11cc@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aaf2123a
    • Jakub Kicinski's avatar
      Merge branch 'tunnels-fix-ipv4-pmtu-icmp-checksum' · ec935188
      Jakub Kicinski authored
      Florian Westphal says:
      
      ====================
      tunnels: fix ipv4 pmtu icmp checksum
      
      The checksum of the generated ipv4 icmp pmtud message is
      only correct if the skb that causes the icmp error generation
      is linear.
      
      Fix this and add a selftest for this.
      ====================
      
      Link: https://lore.kernel.org/r/20230803152653.29535-1-fw@strlen.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec935188
    • Florian Westphal's avatar
      selftests: net: test vxlan pmtu exceptions with tcp · 136a1b43
      Florian Westphal authored
      TCP might get stuck if a nonlinear skb exceeds the path MTU,
      icmp error contains an incorrect icmp checksum in that case.
      
      Extend the existing test for vxlan to also send at least 1MB worth of
      data via TCP in addition to the existing 'large icmp packet adds
      route exception'.
      
      On my test VM this fails due to 0-size output file without
      "tunnels: fix kasan splat when generating ipv4 pmtu error".
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Link: https://lore.kernel.org/r/20230803152653.29535-3-fw@strlen.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      136a1b43
    • Florian Westphal's avatar
      tunnels: fix kasan splat when generating ipv4 pmtu error · 6a7ac3d2
      Florian Westphal authored
      If we try to emit an icmp error in response to a nonliner skb, we get
      
      BUG: KASAN: slab-out-of-bounds in ip_compute_csum+0x134/0x220
      Read of size 4 at addr ffff88811c50db00 by task iperf3/1691
      CPU: 2 PID: 1691 Comm: iperf3 Not tainted 6.5.0-rc3+ #309
      [..]
       kasan_report+0x105/0x140
       ip_compute_csum+0x134/0x220
       iptunnel_pmtud_build_icmp+0x554/0x1020
       skb_tunnel_check_pmtu+0x513/0xb80
       vxlan_xmit_one+0x139e/0x2ef0
       vxlan_xmit+0x1867/0x2760
       dev_hard_start_xmit+0x1ee/0x4f0
       br_dev_queue_push_xmit+0x4d1/0x660
       [..]
      
      ip_compute_csum() cannot deal with nonlinear skbs, so avoid it.
      After this change, splat is gone and iperf3 is no longer stuck.
      
      Fixes: 4cb47a86 ("tunnels: PMTU discovery support for directly bridged IP packets")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Link: https://lore.kernel.org/r/20230803152653.29535-2-fw@strlen.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6a7ac3d2
    • Eric Dumazet's avatar
      net/packet: annotate data-races around tp->status · 8a989617
      Eric Dumazet authored
      Another syzbot report [1] is about tp->status lockless reads
      from __packet_get_status()
      
      [1]
      BUG: KCSAN: data-race in __packet_rcv_has_room / __packet_set_status
      
      write to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 0:
      __packet_set_status+0x78/0xa0 net/packet/af_packet.c:407
      tpacket_rcv+0x18bb/0x1a60 net/packet/af_packet.c:2483
      deliver_skb net/core/dev.c:2173 [inline]
      __netif_receive_skb_core+0x408/0x1e80 net/core/dev.c:5337
      __netif_receive_skb_one_core net/core/dev.c:5491 [inline]
      __netif_receive_skb+0x57/0x1b0 net/core/dev.c:5607
      process_backlog+0x21f/0x380 net/core/dev.c:5935
      __napi_poll+0x60/0x3b0 net/core/dev.c:6498
      napi_poll net/core/dev.c:6565 [inline]
      net_rx_action+0x32b/0x750 net/core/dev.c:6698
      __do_softirq+0xc1/0x265 kernel/softirq.c:571
      invoke_softirq kernel/softirq.c:445 [inline]
      __irq_exit_rcu+0x57/0xa0 kernel/softirq.c:650
      sysvec_apic_timer_interrupt+0x6d/0x80 arch/x86/kernel/apic/apic.c:1106
      asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645
      smpboot_thread_fn+0x33c/0x4a0 kernel/smpboot.c:112
      kthread+0x1d7/0x210 kernel/kthread.c:379
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
      
      read to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 1:
      __packet_get_status net/packet/af_packet.c:436 [inline]
      packet_lookup_frame net/packet/af_packet.c:524 [inline]
      __tpacket_has_room net/packet/af_packet.c:1255 [inline]
      __packet_rcv_has_room+0x3f9/0x450 net/packet/af_packet.c:1298
      tpacket_rcv+0x275/0x1a60 net/packet/af_packet.c:2285
      deliver_skb net/core/dev.c:2173 [inline]
      dev_queue_xmit_nit+0x38a/0x5e0 net/core/dev.c:2243
      xmit_one net/core/dev.c:3574 [inline]
      dev_hard_start_xmit+0xcf/0x3f0 net/core/dev.c:3594
      __dev_queue_xmit+0xefb/0x1d10 net/core/dev.c:4244
      dev_queue_xmit include/linux/netdevice.h:3088 [inline]
      can_send+0x4eb/0x5d0 net/can/af_can.c:276
      bcm_can_tx+0x314/0x410 net/can/bcm.c:302
      bcm_tx_timeout_handler+0xdb/0x260
      __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
      __hrtimer_run_queues+0x217/0x700 kernel/time/hrtimer.c:1749
      hrtimer_run_softirq+0xd6/0x120 kernel/time/hrtimer.c:1766
      __do_softirq+0xc1/0x265 kernel/softirq.c:571
      run_ksoftirqd+0x17/0x20 kernel/softirq.c:939
      smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164
      kthread+0x1d7/0x210 kernel/kthread.c:379
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
      
      value changed: 0x0000000000000000 -> 0x0000000020000081
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 6.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
      
      Fixes: 69e3c75f ("net: TX_RING and packet mmap")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20230803145600.2937518-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a989617
  5. 04 Aug, 2023 6 commits
  6. 03 Aug, 2023 8 commits