1. 25 May, 2023 3 commits
    • Maher Sanalla's avatar
      net/mlx5e: Consider internal buffers size in port buffer calculations · 81fe2be0
      Maher Sanalla authored
      Currently, when a user triggers a change in port buffer headroom
      (buffers 0-7), the driver checks that the requested headroom does
      not exceed the total port buffer size. However, this check does not
      take into account the internal buffers (buffers 8-9), which are also
      part of the total port buffer. This can result in treating invalid port
      buffer change requests as valid, causing unintended changes to the shared
      buffer.
      
      To address this, include the internal buffers size in the calculation of
      available port buffer space which ensures that port buffer requests do not
      exceed the correct limit.
      
      Furthermore, remove internal buffers (8-9) size from the total_size
      calculation as these buffers are reserved for internal use and are not
      exposed to the user.
      
      While at it, add verbosity to the debug prints in
      mlx5e_port_query_buffer() function to ease future debugging.
      
      Fixes: ecdf2dad ("net/mlx5e: Receive buffer support for DCBX")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      81fe2be0
    • Chris Mi's avatar
      net/mlx5e: Prevent encap offload when neigh update is running · 37c3b9fa
      Chris Mi authored
      The cited commit adds a compeletion to remove dependency on rtnl
      lock. But it causes a deadlock for multiple encapsulations:
      
       crash> bt ffff8aece8a64000
       PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
        #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
        #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
        #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
        #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
        #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
        #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
        #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
        #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
        #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
        #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
       #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
       #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
       #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
       #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
       #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
       #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
       #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
       #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
       #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
       #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
       #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
       #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
       #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
       #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
       #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
       #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
       #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
       #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
       crash> bt 0xffff8aeb07544000
       PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
        #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
        #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
        #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
        #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
        #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
        #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
        #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
        #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
        #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
        #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
       #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f
      
      After the first encap is attached, flow will be added to encap
      entry's flows list. If neigh update is running at this time, the
      following encaps of the flow can't hold the encap_tbl_lock and
      sleep. If neigh update thread is waiting for that flow's init_done,
      deadlock happens.
      
      Fix it by holding lock outside of the for loop. If neigh update is
      running, prevent encap flows from offloading. Since the lock is held
      outside of the for loop, concurrent creation of encap entries is not
      allowed. So remove unnecessary wait_for_completion call for res_ready.
      
      Fixes: 95435ad7 ("net/mlx5e: Only access fully initialized flows in neigh update")
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      37c3b9fa
    • Chris Mi's avatar
      net/mlx5e: Extract remaining tunnel encap code to dedicated file · e2ab5aa1
      Chris Mi authored
      Move set_encap_dests() and clean_encap_dests() to the tunnel encap
      dedicated file. And rename them to mlx5e_tc_tun_encap_dests_set()
      and mlx5e_tc_tun_encap_dests_unset().
      
      No functional change in this patch. It is needed in the next patch.
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      e2ab5aa1
  2. 24 May, 2023 7 commits
  3. 23 May, 2023 21 commits
    • Nicolas Dichtel's avatar
      ipv{4,6}/raw: fix output xfrm lookup wrt protocol · 3632679d
      Nicolas Dichtel authored
      With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
      protocol field of the flow structure, build by raw_sendmsg() /
      rawv6_sendmsg()),  is set to IPPROTO_RAW. This breaks the ipsec policy
      lookup when some policies are defined with a protocol in the selector.
      
      For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
      specify the protocol. Just accept all values for IPPROTO_RAW socket.
      
      For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
      without breaking backward compatibility (the value of this field was never
      checked). Let's add a new kind of control message, so that the userland
      could specify which protocol is used.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3632679d
    • Horatiu Vultur's avatar
      lan966x: Fix unloading/loading of the driver · 60076124
      Horatiu Vultur authored
      It was noticing that after a while when unloading/loading the driver and
      sending traffic through the switch, it would stop working. It would stop
      forwarding any traffic and the only way to get out of this was to do a
      power cycle of the board. The root cause seems to be that the switch
      core is initialized twice. Apparently initializing twice the switch core
      disturbs the pointers in the queue systems in the HW, so after a while
      it would stop sending the traffic.
      Unfortunetly, it is not possible to use a reset of the switch here,
      because the reset line is connected to multiple devices like MDIO,
      SGPIO, FAN, etc. So then all the devices will get reseted when the
      network driver will be loaded.
      So the fix is to check if the core is initialized already and if that is
      the case don't initialize it again.
      
      Fixes: db8bcaad ("net: lan966x: add the basic lan966x driver")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230522120038.3749026-1-horatiu.vultur@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      60076124
    • Shay Drory's avatar
      net/mlx5: Fix indexing of mlx5_irq · 1da438c0
      Shay Drory authored
      After the cited patch, mlx5_irq xarray index can be different then
      mlx5_irq MSIX table index.
      Fix it by storing both mlx5_irq xarray index and MSIX table index.
      
      Fixes: 3354822c ("net/mlx5: Use dynamic msix vectors allocation")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1da438c0
    • Shay Drory's avatar
      net/mlx5: Fix irq affinity management · ef8c063c
      Shay Drory authored
      The cited patch deny the user of changing the affinity of mlx5 irqs,
      which break backward compatibility.
      Hence, allow the user to change the affinity of mlx5 irqs.
      
      Fixes: bbac70c7 ("net/mlx5: Use newer affinity descriptor")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef8c063c
    • Shay Drory's avatar
      net/mlx5: Free irqs only on shutdown callback · 9c2d0801
      Shay Drory authored
      Whenever a shutdown is invoked, free irqs only and keep mlx5_irq
      synthetic wrapper intact in order to avoid use-after-free on
      system shutdown.
      
      for example:
      ==================================================================
      BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80
      Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608
      
      CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
      Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
      Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
      Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x48
        print_report+0x170/0x473
        ? _find_first_bit+0x66/0x80
        kasan_report+0xad/0x130
        ? _find_first_bit+0x66/0x80
        _find_first_bit+0x66/0x80
        mlx5e_open_channels+0x3c5/0x3a10 [mlx5_core]
        ? console_unlock+0x2fa/0x430
        ? _raw_spin_lock_irqsave+0x8d/0xf0
        ? _raw_spin_unlock_irqrestore+0x42/0x80
        ? preempt_count_add+0x7d/0x150
        ? __wake_up_klogd.part.0+0x7d/0xc0
        ? vprintk_emit+0xfe/0x2c0
        ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
        ? dev_attr_show.cold+0x35/0x35
        ? devlink_health_do_dump.part.0+0x174/0x340
        ? devlink_health_report+0x504/0x810
        ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        ? process_one_work+0x680/0x1050
        mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
        ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
        ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
        mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
        ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
        devlink_health_reporter_recover+0xa6/0x1f0
        devlink_health_report+0x2f7/0x810
        ? vsnprintf+0x854/0x15e0
        mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
        ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
        ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
        ? newidle_balance+0x9b7/0xe30
        ? psi_group_change+0x6a7/0xb80
        ? mutex_lock+0x96/0xf0
        ? __mutex_lock_slowpath+0x10/0x10
        mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        process_one_work+0x680/0x1050
        worker_thread+0x5a0/0xeb0
        ? process_one_work+0x1050/0x1050
        kthread+0x2a2/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x22/0x30
        </TASK>
      
      Freed by task 1:
        kasan_save_stack+0x23/0x50
        kasan_set_track+0x21/0x30
        kasan_save_free_info+0x2a/0x40
        ____kasan_slab_free+0x169/0x1d0
        slab_free_freelist_hook+0xd2/0x190
        __kmem_cache_free+0x1a1/0x2f0
        irq_pool_free+0x138/0x200 [mlx5_core]
        mlx5_irq_table_destroy+0xf6/0x170 [mlx5_core]
        mlx5_core_eq_free_irqs+0x74/0xf0 [mlx5_core]
        shutdown+0x194/0x1aa [mlx5_core]
        pci_device_shutdown+0x75/0x120
        device_shutdown+0x35c/0x620
        kernel_restart+0x60/0xa0
        __do_sys_reboot+0x1cb/0x2c0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x4b/0xb5
      
      The buggy address belongs to the object at ffff88823fc0d300
        which belongs to the cache kmalloc-192 of size 192
      The buggy address is located 24 bytes inside of
        192-byte region [ffff88823fc0d300, ffff88823fc0d3c0)
      
      The buggy address belongs to the physical page:
      page:0000000010139587 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x23fc0c
      head:0000000010139587 order:1 compound_mapcount:0 compound_pincount:0
      flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
      raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004ca00
      raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff88823fc0d200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88823fc0d280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       >ffff88823fc0d300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                   ^
        ffff88823fc0d380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
        ffff88823fc0d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ==================================================================
      general protection fault, probably for non-canonical address
      0xdffffc005c40d7ac: 0000 [#1] PREEMPT SMP KASAN NOPTI
      KASAN: probably user-memory-access in range [0x00000002e206bd60-0x00000002e206bd67]
      CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
      Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
      Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
      RIP: 0010:__alloc_pages+0x141/0x5c0
      Call Trace:
        <TASK>
        ? sysvec_apic_timer_interrupt+0xa0/0xc0
        ? asm_sysvec_apic_timer_interrupt+0x16/0x20
        ? __alloc_pages_slowpath.constprop.0+0x1ec0/0x1ec0
        ? _raw_spin_unlock_irqrestore+0x3d/0x80
        __kmalloc_large_node+0x80/0x120
        ? kvmalloc_node+0x4e/0x170
        __kmalloc_node+0xd4/0x150
        kvmalloc_node+0x4e/0x170
        mlx5e_open_channels+0x631/0x3a10 [mlx5_core]
        ? console_unlock+0x2fa/0x430
        ? _raw_spin_lock_irqsave+0x8d/0xf0
        ? _raw_spin_unlock_irqrestore+0x42/0x80
        ? preempt_count_add+0x7d/0x150
        ? __wake_up_klogd.part.0+0x7d/0xc0
        ? vprintk_emit+0xfe/0x2c0
        ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
        ? dev_attr_show.cold+0x35/0x35
        ? devlink_health_do_dump.part.0+0x174/0x340
        ? devlink_health_report+0x504/0x810
        ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        ? process_one_work+0x680/0x1050
        mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
        ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
        ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
        mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
        ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
        devlink_health_reporter_recover+0xa6/0x1f0
        devlink_health_report+0x2f7/0x810
        ? vsnprintf+0x854/0x15e0
        mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
        ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
        ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
        ? newidle_balance+0x9b7/0xe30
        ? psi_group_change+0x6a7/0xb80
        ? mutex_lock+0x96/0xf0
        ? __mutex_lock_slowpath+0x10/0x10
        mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        process_one_work+0x680/0x1050
        worker_thread+0x5a0/0xeb0
        ? process_one_work+0x1050/0x1050
        kthread+0x2a2/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x22/0x30
        </TASK>
      ---[ end trace 0000000000000000  ]---
      RIP: 0010:__alloc_pages+0x141/0x5c0
      Code: e0 39 a3 96 89 e9 b8 22 01 32 01 83 e1 0f 48 89 fa 01 c9 48 c1 ea
      03 d3 f8 83 e0 03 89 44 24 6c 48 b8 00 00 00 00 00 fc ff df <80> 3c 02
      00 0f 85 fc 03 00 00 89 e8 4a 8b 14 f5 e0 39 a3 96 4c 89
      RSP: 0018:ffff888251f0f438 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 1ffff1104a3e1e8b RCX: 0000000000000000
      RDX: 000000005c40d7ac RSI: 0000000000000003 RDI: 00000002e206bd60
      RBP: 0000000000052dc0 R08: ffff8882b0044218 R09: ffff8882b0045e8a
      R10: fffffbfff300fefc R11: ffff888167af4000 R12: 0000000000000003
      R13: 0000000000000000 R14: 00000000696c7070 R15: ffff8882373f4380
      FS:  0000000000000000(0000) GS:ffff88bf2be80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005641d031eee8 CR3: 0000002e7ca14000 CR4: 0000000000350ee0
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range:
      0xffffffff80000000-0xffffffffbfffffff)
      ---[ end Kernel panic - not syncing: Fatal exception  ]---]
      Reported-by: default avatarFrederick Lawler <fred@cloudflare.com>
      Link: https://lore.kernel.org/netdev/be5b9271-7507-19c5-ded1-fa78f1980e69@cloudflare.comSigned-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9c2d0801
    • Shay Drory's avatar
      net/mlx5: Devcom, serialize devcom registration · 1f893f57
      Shay Drory authored
      From one hand, mlx5 driver is allowing to probe PFs in parallel.
      From the other hand, devcom, which is a share resource between PFs, is
      registered without any lock. This might resulted in memory problems.
      
      Hence, use the global mlx5_dev_list_lock in order to serialize devcom
      registration.
      
      Fixes: fadd59fc ("net/mlx5: Introduce inter-device communication mechanism")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1f893f57
    • Shay Drory's avatar
      net/mlx5: Devcom, fix error flow in mlx5_devcom_register_device · af871943
      Shay Drory authored
      In case devcom allocation is failed, mlx5 is always freeing the priv.
      However, this priv might have been allocated by a different thread,
      and freeing it might lead to use-after-free bugs.
      Fix it by freeing the priv only in case it was allocated by the
      running thread.
      
      Fixes: fadd59fc ("net/mlx5: Introduce inter-device communication mechanism")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      af871943
    • Shay Drory's avatar
      net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register · 8c253dfc
      Shay Drory authored
      devcom events are sent to all registered component. Following the
      cited patch, it is possible for two components, e.g.: two eswitches,
      to send devcom events, while both components are registered. This
      means eswitch layer will do double un/pairing, which is double
      allocation and free of resources, even though only one un/pairing is
      needed. flow example:
      
      	cpu0					cpu1
      	----					----
      
       mlx5_devlink_eswitch_mode_set(dev0)
        esw_offloads_devcom_init()
         mlx5_devcom_register_component(esw0)
                                               mlx5_devlink_eswitch_mode_set(dev1)
                                                esw_offloads_devcom_init()
                                                 mlx5_devcom_register_component(esw1)
                                                 mlx5_devcom_send_event()
         mlx5_devcom_send_event()
      
      Hence, check whether the eswitches are already un/paired before
      free/allocation of resources.
      
      Fixes: 09b27846 ("net: devlink: enable parallel ops on netlink interface")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8c253dfc
    • Paul Blakey's avatar
      net/mlx5e: TC, Fix using eswitch mapping in nic mode · dfa1e46d
      Paul Blakey authored
      Cited patch is using the eswitch object mapping pool while
      in nic mode where it isn't initialized. This results in the
      trace below [0].
      
      Fix that by using either nic or eswitch object mapping pool
      depending if eswitch is enabled or not.
      
      [0]:
      [  826.446057] ==================================================================
      [  826.446729] BUG: KASAN: slab-use-after-free in mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.447515] Read of size 8 at addr ffff888194485830 by task tc/6233
      
      [  826.448243] CPU: 16 PID: 6233 Comm: tc Tainted: G        W          6.3.0-rc6+ #1
      [  826.448890] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  826.449785] Call Trace:
      [  826.450052]  <TASK>
      [  826.450302]  dump_stack_lvl+0x33/0x50
      [  826.450650]  print_report+0xc2/0x610
      [  826.450998]  ? __virt_addr_valid+0xb1/0x130
      [  826.451385]  ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.451935]  kasan_report+0xae/0xe0
      [  826.452276]  ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.452829]  mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.453368]  ? __kmalloc_node+0x5a/0x120
      [  826.453733]  esw_add_restore_rule+0x20f/0x270 [mlx5_core]
      [  826.454288]  ? mlx5_eswitch_add_send_to_vport_meta_rule+0x260/0x260 [mlx5_core]
      [  826.455011]  ? mutex_unlock+0x80/0xd0
      [  826.455361]  ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
      [  826.455862]  ? mapping_add+0x2cb/0x440 [mlx5_core]
      [  826.456425]  mlx5e_tc_action_miss_mapping_get+0x139/0x180 [mlx5_core]
      [  826.457058]  ? mlx5e_tc_update_skb_nic+0xb0/0xb0 [mlx5_core]
      [  826.457636]  ? __kasan_kmalloc+0x77/0x90
      [  826.458000]  ? __kmalloc+0x57/0x120
      [  826.458336]  mlx5_tc_ct_flow_offload+0x325/0xe40 [mlx5_core]
      [  826.458916]  ? ct_kernel_enter.constprop.0+0x48/0xa0
      [  826.459360]  ? mlx5_tc_ct_parse_action+0xf0/0xf0 [mlx5_core]
      [  826.459933]  ? mlx5e_mod_hdr_attach+0x491/0x520 [mlx5_core]
      [  826.460507]  ? mlx5e_mod_hdr_get+0x12/0x20 [mlx5_core]
      [  826.461046]  ? mlx5e_tc_attach_mod_hdr+0x154/0x170 [mlx5_core]
      [  826.461635]  mlx5e_configure_flower+0x969/0x2110 [mlx5_core]
      [  826.462217]  ? _raw_spin_lock_bh+0x85/0xe0
      [  826.462597]  ? __mlx5e_add_fdb_flow+0x750/0x750 [mlx5_core]
      [  826.463163]  ? kasan_save_stack+0x2e/0x40
      [  826.463534]  ? down_read+0x115/0x1b0
      [  826.463878]  ? down_write_killable+0x110/0x110
      [  826.464288]  ? tc_setup_action.part.0+0x9f/0x3b0
      [  826.464701]  ? mlx5e_is_uplink_rep+0x4c/0x90 [mlx5_core]
      [  826.465253]  ? mlx5e_tc_reoffload_flows_work+0x130/0x130 [mlx5_core]
      [  826.465878]  tc_setup_cb_add+0x112/0x250
      [  826.466247]  fl_hw_replace_filter+0x230/0x310 [cls_flower]
      [  826.466724]  ? fl_hw_destroy_filter+0x1a0/0x1a0 [cls_flower]
      [  826.467212]  fl_change+0x14e1/0x2030 [cls_flower]
      [  826.467636]  ? sock_def_readable+0x89/0x120
      [  826.468019]  ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower]
      [  826.468509]  ? kasan_unpoison+0x23/0x50
      [  826.468873]  ? get_random_u16+0x180/0x180
      [  826.469244]  ? __radix_tree_lookup+0x2b/0x130
      [  826.469640]  ? fl_get+0x7b/0x140 [cls_flower]
      [  826.470042]  ? fl_mask_put+0x200/0x200 [cls_flower]
      [  826.470478]  ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
      [  826.470973]  ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower]
      [  826.471427]  tc_new_tfilter+0x644/0x1050
      [  826.471795]  ? tc_get_tfilter+0x860/0x860
      [  826.472170]  ? __thaw_task+0x130/0x130
      [  826.472525]  ? arch_stack_walk+0x98/0xf0
      [  826.472892]  ? cap_capable+0x9f/0xd0
      [  826.473235]  ? security_capable+0x47/0x60
      [  826.473608]  rtnetlink_rcv_msg+0x1d5/0x550
      [  826.473985]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
      [  826.474383]  ? __stack_depot_save+0x35/0x4c0
      [  826.474779]  ? kasan_save_stack+0x2e/0x40
      [  826.475149]  ? kasan_save_stack+0x1e/0x40
      [  826.475518]  ? __kasan_record_aux_stack+0x9f/0xb0
      [  826.475939]  ? task_work_add+0x77/0x1c0
      [  826.476305]  netlink_rcv_skb+0xe0/0x210
      [  826.476661]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
      [  826.477057]  ? netlink_ack+0x7c0/0x7c0
      [  826.477412]  ? rhashtable_jhash2+0xef/0x150
      [  826.477796]  ? _copy_from_iter+0x105/0x770
      [  826.484386]  netlink_unicast+0x346/0x490
      [  826.484755]  ? netlink_attachskb+0x400/0x400
      [  826.485145]  ? kernel_text_address+0xc2/0xd0
      [  826.485535]  netlink_sendmsg+0x3b0/0x6c0
      [  826.485902]  ? kernel_text_address+0xc2/0xd0
      [  826.486296]  ? netlink_unicast+0x490/0x490
      [  826.486671]  ? iovec_from_user.part.0+0x7a/0x1a0
      [  826.487083]  ? netlink_unicast+0x490/0x490
      [  826.487461]  sock_sendmsg+0x73/0xc0
      [  826.487803]  ____sys_sendmsg+0x364/0x380
      [  826.488186]  ? import_iovec+0x7/0x10
      [  826.488531]  ? kernel_sendmsg+0x30/0x30
      [  826.488893]  ? __copy_msghdr+0x180/0x180
      [  826.489258]  ? kasan_save_stack+0x2e/0x40
      [  826.489629]  ? kasan_save_stack+0x1e/0x40
      [  826.490002]  ? __kasan_record_aux_stack+0x9f/0xb0
      [  826.490424]  ? __call_rcu_common.constprop.0+0x46/0x580
      [  826.490876]  ___sys_sendmsg+0xdf/0x140
      [  826.491231]  ? copy_msghdr_from_user+0x110/0x110
      [  826.491649]  ? fget_raw+0x120/0x120
      [  826.491988]  ? ___sys_recvmsg+0xd9/0x130
      [  826.492355]  ? folio_batch_add_and_move+0x80/0xa0
      [  826.492776]  ? _raw_spin_lock+0x7a/0xd0
      [  826.493137]  ? _raw_spin_lock+0x7a/0xd0
      [  826.493500]  ? _raw_read_lock_irq+0x30/0x30
      [  826.493880]  ? kasan_set_track+0x21/0x30
      [  826.494249]  ? kasan_save_free_info+0x2a/0x40
      [  826.494650]  ? do_sys_openat2+0xff/0x270
      [  826.495016]  ? __fget_light+0x1b5/0x200
      [  826.495377]  ? __virt_addr_valid+0xb1/0x130
      [  826.495763]  __sys_sendmsg+0xb2/0x130
      [  826.496118]  ? __sys_sendmsg_sock+0x20/0x20
      [  826.496501]  ? __x64_sys_rseq+0x2e0/0x2e0
      [  826.496874]  ? do_user_addr_fault+0x276/0x820
      [  826.497273]  ? fpregs_assert_state_consistent+0x52/0x60
      [  826.497727]  ? exit_to_user_mode_prepare+0x30/0x120
      [  826.498158]  do_syscall_64+0x3d/0x90
      [  826.498502]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  826.498949] RIP: 0033:0x7f9b67f4f887
      [  826.499294] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  826.500742] RSP: 002b:00007fff5d1a5498 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  826.501395] RAX: ffffffffffffffda RBX: 0000000064413ce6 RCX: 00007f9b67f4f887
      [  826.501975] RDX: 0000000000000000 RSI: 00007fff5d1a5500 RDI: 0000000000000003
      [  826.502556] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
      [  826.503135] R10: 00007f9b67e08708 R11: 0000000000000246 R12: 0000000000000001
      [  826.503714] R13: 0000000000000001 R14: 00007fff5d1a9800 R15: 0000000000485400
      [  826.504304]  </TASK>
      
      [  826.504753] Allocated by task 3764:
      [  826.505090]  kasan_save_stack+0x1e/0x40
      [  826.505453]  kasan_set_track+0x21/0x30
      [  826.505810]  __kasan_kmalloc+0x77/0x90
      [  826.506164]  __mlx5_create_flow_table+0x16d/0xbb0 [mlx5_core]
      [  826.506742]  esw_offloads_enable+0x60d/0xfb0 [mlx5_core]
      [  826.507292]  mlx5_eswitch_enable_locked+0x4d3/0x680 [mlx5_core]
      [  826.507885]  mlx5_devlink_eswitch_mode_set+0x2a3/0x580 [mlx5_core]
      [  826.508513]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
      [  826.508969]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
      [  826.509427]  genl_rcv_msg+0x28d/0x3e0
      [  826.509772]  netlink_rcv_skb+0xe0/0x210
      [  826.510133]  genl_rcv+0x24/0x40
      [  826.510448]  netlink_unicast+0x346/0x490
      [  826.510810]  netlink_sendmsg+0x3b0/0x6c0
      [  826.511179]  sock_sendmsg+0x73/0xc0
      [  826.511519]  __sys_sendto+0x18d/0x220
      [  826.511867]  __x64_sys_sendto+0x72/0x80
      [  826.512232]  do_syscall_64+0x3d/0x90
      [  826.512576]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      [  826.513220] Freed by task 5674:
      [  826.513535]  kasan_save_stack+0x1e/0x40
      [  826.513893]  kasan_set_track+0x21/0x30
      [  826.514245]  kasan_save_free_info+0x2a/0x40
      [  826.514629]  ____kasan_slab_free+0x11a/0x1b0
      [  826.515021]  __kmem_cache_free+0x14d/0x280
      [  826.515399]  tree_put_node+0x109/0x1c0 [mlx5_core]
      [  826.515907]  mlx5_destroy_flow_table+0x119/0x630 [mlx5_core]
      [  826.516481]  esw_offloads_steering_cleanup+0xe7/0x150 [mlx5_core]
      [  826.517084]  esw_offloads_disable+0xe0/0x160 [mlx5_core]
      [  826.517632]  mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core]
      [  826.518225]  mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core]
      [  826.518834]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
      [  826.519286]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
      [  826.519748]  genl_rcv_msg+0x28d/0x3e0
      [  826.520101]  netlink_rcv_skb+0xe0/0x210
      [  826.520458]  genl_rcv+0x24/0x40
      [  826.520771]  netlink_unicast+0x346/0x490
      [  826.521137]  netlink_sendmsg+0x3b0/0x6c0
      [  826.521505]  sock_sendmsg+0x73/0xc0
      [  826.521842]  __sys_sendto+0x18d/0x220
      [  826.522191]  __x64_sys_sendto+0x72/0x80
      [  826.522554]  do_syscall_64+0x3d/0x90
      [  826.522894]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      [  826.523540] Last potentially related work creation:
      [  826.523969]  kasan_save_stack+0x1e/0x40
      [  826.524331]  __kasan_record_aux_stack+0x9f/0xb0
      [  826.524739]  insert_work+0x30/0x130
      [  826.525078]  __queue_work+0x34b/0x690
      [  826.525426]  queue_work_on+0x48/0x50
      [  826.525766]  __rhashtable_remove_fast_one+0x4af/0x4d0 [mlx5_core]
      [  826.526365]  del_sw_flow_group+0x1b5/0x270 [mlx5_core]
      [  826.526898]  tree_put_node+0x109/0x1c0 [mlx5_core]
      [  826.527407]  esw_offloads_steering_cleanup+0xd3/0x150 [mlx5_core]
      [  826.528009]  esw_offloads_disable+0xe0/0x160 [mlx5_core]
      [  826.528616]  mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core]
      [  826.529218]  mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core]
      [  826.529823]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
      [  826.530276]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
      [  826.530733]  genl_rcv_msg+0x28d/0x3e0
      [  826.531079]  netlink_rcv_skb+0xe0/0x210
      [  826.531439]  genl_rcv+0x24/0x40
      [  826.531755]  netlink_unicast+0x346/0x490
      [  826.532123]  netlink_sendmsg+0x3b0/0x6c0
      [  826.532487]  sock_sendmsg+0x73/0xc0
      [  826.532825]  __sys_sendto+0x18d/0x220
      [  826.533175]  __x64_sys_sendto+0x72/0x80
      [  826.533533]  do_syscall_64+0x3d/0x90
      [  826.533877]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      [  826.534521] The buggy address belongs to the object at ffff888194485800
                      which belongs to the cache kmalloc-512 of size 512
      [  826.535506] The buggy address is located 48 bytes inside of
                      freed 512-byte region [ffff888194485800, ffff888194485a00)
      
      [  826.536666] The buggy address belongs to the physical page:
      [  826.537138] page:00000000d75841dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x194480
      [  826.537915] head:00000000d75841dd order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
      [  826.538595] flags: 0x200000000010200(slab|head|node=0|zone=2)
      [  826.539089] raw: 0200000000010200 ffff888100042c80 ffffea0004523800 dead000000000002
      [  826.539755] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
      [  826.540417] page dumped because: kasan: bad access detected
      
      [  826.541095] Memory state around the buggy address:
      [  826.541519]  ffff888194485700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  826.542149]  ffff888194485780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  826.542773] >ffff888194485800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  826.543400]                                      ^
      [  826.543822]  ffff888194485880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  826.544452]  ffff888194485900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  826.545079] ==================================================================
      
      Fixes: 67027828 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
      Signed-off-by: default avatarPaul Blakey <paulb@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      dfa1e46d
    • Rahul Rameshbabu's avatar
      net/mlx5e: Fix SQ wake logic in ptp napi_poll context · 7aa50380
      Rahul Rameshbabu authored
      Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken
      up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb
      fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up.
      
      Fixes: 1880bc4e ("net/mlx5e: Add TX port timestamp support")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7aa50380
    • Vlad Buslov's avatar
      net/mlx5e: Fix deadlock in tc route query code · 691c041b
      Vlad Buslov authored
      Cited commit causes ABBA deadlock[0] when peer flows are created while
      holding the devcom rw semaphore. Due to peer flows offload implementation
      the lock is taken much higher up the call chain and there is no obvious way
      to easily fix the deadlock. Instead, since tc route query code needs the
      peer eswitch structure only to perform a lookup in xarray and doesn't
      perform any sleeping operations with it, refactor the code for lockless
      execution in following ways:
      
      - RCUify the devcom 'data' pointer. When resetting the pointer
      synchronously wait for RCU grace period before returning. This is fine
      since devcom is currently only used for synchronization of
      pairing/unpairing of eswitches which is rare and already expensive as-is.
      
      - Wrap all usages of 'paired' boolean in {READ|WRITE}_ONCE(). The flag has
      already been used in some unlocked contexts without proper
      annotations (e.g. users of mlx5_devcom_is_paired() function), but it wasn't
      an issue since all relevant code paths checked it again after obtaining the
      devcom semaphore. Now it is also used by mlx5_devcom_get_peer_data_rcu() as
      "best effort" check to return NULL when devcom is being unpaired. Note that
      while RCU read lock doesn't prevent the unpaired flag from being changed
      concurrently it still guarantees that reader can continue to use 'data'.
      
      - Refactor mlx5e_tc_query_route_vport() function to use new
      mlx5_devcom_get_peer_data_rcu() API which fixes the deadlock.
      
      [0]:
      
      [  164.599612] ======================================================
      [  164.600142] WARNING: possible circular locking dependency detected
      [  164.600667] 6.3.0-rc3+ #1 Not tainted
      [  164.601021] ------------------------------------------------------
      [  164.601557] handler1/3456 is trying to acquire lock:
      [  164.601998] ffff88811f1714b0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}, at: mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.603078]
                     but task is already holding lock:
      [  164.603617] ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
      [  164.604459]
                     which lock already depends on the new lock.
      
      [  164.605190]
                     the existing dependency chain (in reverse order) is:
      [  164.605848]
                     -> #1 (&comp->sem){++++}-{3:3}:
      [  164.606380]        down_read+0x39/0x50
      [  164.606772]        mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
      [  164.607336]        mlx5e_tc_query_route_vport+0x86/0xc0 [mlx5_core]
      [  164.607914]        mlx5e_tc_tun_route_lookup+0x1a4/0x1d0 [mlx5_core]
      [  164.608495]        mlx5e_attach_decap_route+0xc6/0x1e0 [mlx5_core]
      [  164.609063]        mlx5e_tc_add_fdb_flow+0x1ea/0x360 [mlx5_core]
      [  164.609627]        __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
      [  164.610175]        mlx5e_configure_flower+0x952/0x1a20 [mlx5_core]
      [  164.610741]        tc_setup_cb_add+0xd4/0x200
      [  164.611146]        fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
      [  164.611661]        fl_change+0xc95/0x18a0 [cls_flower]
      [  164.612116]        tc_new_tfilter+0x3fc/0xd20
      [  164.612516]        rtnetlink_rcv_msg+0x418/0x5b0
      [  164.612936]        netlink_rcv_skb+0x54/0x100
      [  164.613339]        netlink_unicast+0x190/0x250
      [  164.613746]        netlink_sendmsg+0x245/0x4a0
      [  164.614150]        sock_sendmsg+0x38/0x60
      [  164.614522]        ____sys_sendmsg+0x1d0/0x1e0
      [  164.614934]        ___sys_sendmsg+0x80/0xc0
      [  164.615320]        __sys_sendmsg+0x51/0x90
      [  164.615701]        do_syscall_64+0x3d/0x90
      [  164.616083]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  164.616568]
                     -> #0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}:
      [  164.617210]        __lock_acquire+0x159e/0x26e0
      [  164.617638]        lock_acquire+0xc2/0x2a0
      [  164.618018]        __mutex_lock+0x92/0xcd0
      [  164.618401]        mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.618943]        post_process_attr+0x153/0x2d0 [mlx5_core]
      [  164.619471]        mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core]
      [  164.620021]        __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
      [  164.620564]        mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core]
      [  164.621125]        tc_setup_cb_add+0xd4/0x200
      [  164.621531]        fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
      [  164.622047]        fl_change+0xc95/0x18a0 [cls_flower]
      [  164.622500]        tc_new_tfilter+0x3fc/0xd20
      [  164.622906]        rtnetlink_rcv_msg+0x418/0x5b0
      [  164.623324]        netlink_rcv_skb+0x54/0x100
      [  164.623727]        netlink_unicast+0x190/0x250
      [  164.624138]        netlink_sendmsg+0x245/0x4a0
      [  164.624544]        sock_sendmsg+0x38/0x60
      [  164.624919]        ____sys_sendmsg+0x1d0/0x1e0
      [  164.625340]        ___sys_sendmsg+0x80/0xc0
      [  164.625731]        __sys_sendmsg+0x51/0x90
      [  164.626117]        do_syscall_64+0x3d/0x90
      [  164.626502]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  164.626995]
                     other info that might help us debug this:
      
      [  164.627725]  Possible unsafe locking scenario:
      
      [  164.628268]        CPU0                    CPU1
      [  164.628683]        ----                    ----
      [  164.629098]   lock(&comp->sem);
      [  164.629421]                                lock(&esw->offloads.encap_tbl_lock);
      [  164.630066]                                lock(&comp->sem);
      [  164.630555]   lock(&esw->offloads.encap_tbl_lock);
      [  164.630993]
                      *** DEADLOCK ***
      
      [  164.631575] 3 locks held by handler1/3456:
      [  164.631962]  #0: ffff888124b75130 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200
      [  164.632703]  #1: ffff888116e512b8 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core]
      [  164.633552]  #2: ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
      [  164.634435]
                     stack backtrace:
      [  164.634883] CPU: 17 PID: 3456 Comm: handler1 Not tainted 6.3.0-rc3+ #1
      [  164.635431] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  164.636340] Call Trace:
      [  164.636616]  <TASK>
      [  164.636863]  dump_stack_lvl+0x47/0x70
      [  164.637217]  check_noncircular+0xfe/0x110
      [  164.637601]  __lock_acquire+0x159e/0x26e0
      [  164.637977]  ? mlx5_cmd_set_fte+0x5b0/0x830 [mlx5_core]
      [  164.638472]  lock_acquire+0xc2/0x2a0
      [  164.638828]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.639339]  ? lock_is_held_type+0x98/0x110
      [  164.639728]  __mutex_lock+0x92/0xcd0
      [  164.640074]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.640576]  ? __lock_acquire+0x382/0x26e0
      [  164.640958]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.641468]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.641965]  mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.642454]  ? lock_release+0xbf/0x240
      [  164.642819]  post_process_attr+0x153/0x2d0 [mlx5_core]
      [  164.643318]  mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core]
      [  164.643835]  __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
      [  164.644340]  mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core]
      [  164.644862]  ? lock_acquire+0xc2/0x2a0
      [  164.645219]  tc_setup_cb_add+0xd4/0x200
      [  164.645588]  fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
      [  164.646067]  fl_change+0xc95/0x18a0 [cls_flower]
      [  164.646488]  tc_new_tfilter+0x3fc/0xd20
      [  164.646861]  ? tc_del_tfilter+0x810/0x810
      [  164.647236]  rtnetlink_rcv_msg+0x418/0x5b0
      [  164.647621]  ? rtnl_setlink+0x160/0x160
      [  164.647982]  netlink_rcv_skb+0x54/0x100
      [  164.648348]  netlink_unicast+0x190/0x250
      [  164.648722]  netlink_sendmsg+0x245/0x4a0
      [  164.649090]  sock_sendmsg+0x38/0x60
      [  164.649434]  ____sys_sendmsg+0x1d0/0x1e0
      [  164.649804]  ? copy_msghdr_from_user+0x6d/0xa0
      [  164.650213]  ___sys_sendmsg+0x80/0xc0
      [  164.650563]  ? lock_acquire+0xc2/0x2a0
      [  164.650926]  ? lock_acquire+0xc2/0x2a0
      [  164.651286]  ? __fget_files+0x5/0x190
      [  164.651644]  ? find_held_lock+0x2b/0x80
      [  164.652006]  ? __fget_files+0xb9/0x190
      [  164.652365]  ? lock_release+0xbf/0x240
      [  164.652723]  ? __fget_files+0xd3/0x190
      [  164.653079]  __sys_sendmsg+0x51/0x90
      [  164.653435]  do_syscall_64+0x3d/0x90
      [  164.653784]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  164.654229] RIP: 0033:0x7f378054f8bd
      [  164.654577] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 6a c3 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 be c3 f4 ff 48
      [  164.656041] RSP: 002b:00007f377fa114b0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
      [  164.656701] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f378054f8bd
      [  164.657297] RDX: 0000000000000000 RSI: 00007f377fa11540 RDI: 0000000000000014
      [  164.657885] RBP: 00007f377fa12278 R08: 0000000000000000 R09: 000000000000015c
      [  164.658472] R10: 00007f377fa123d0 R11: 0000000000000293 R12: 0000560962d99bd0
      [  164.665317] R13: 0000000000000000 R14: 0000560962d99bd0 R15: 00007f377fa11540
      
      Fixes: f9d196bd ("net/mlx5e: Use correct eswitch for stack devices with lag")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      691c041b
    • Roi Dayan's avatar
      net/mlx5: Fix error message when failing to allocate device memory · a6573514
      Roi Dayan authored
      Fix spacing for the error and also the correct error code pointer.
      
      Fixes: c9b9dcb4 ("net/mlx5: Move device memory management to mlx5_core")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a6573514
    • Vlad Buslov's avatar
      net/mlx5e: Use correct encap attribute during invalidation · be071cdb
      Vlad Buslov authored
      With introduction of post action infrastructure most of the users of encap
      attribute had been modified in order to obtain the correct attribute by
      calling mlx5e_tc_get_encap_attr() helper instead of assuming encap action
      is always on default attribute. However, the cited commit didn't modify
      mlx5e_invalidate_encap() which prevents it from destroying correct modify
      header action which leads to a warning [0]. Fix the issue by using correct
      attribute.
      
      [0]:
      
      Feb 21 09:47:35 c-237-177-40-045 kernel: WARNING: CPU: 17 PID: 654 at drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:684 mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel: RIP: 0010:mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel: Call Trace:
      Feb 21 09:47:35 c-237-177-40-045 kernel:  <TASK>
      Feb 21 09:47:35 c-237-177-40-045 kernel:  mlx5e_tc_fib_event_work+0x8e3/0x1f60 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? mlx5e_take_all_encap_flows+0xe0/0xe0 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lock_downgrade+0x6d0/0x6d0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  process_one_work+0x7c2/0x1310
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? pwq_dec_nr_in_flight+0x230/0x230
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? rwlock_bug.part.0+0x90/0x90
      Feb 21 09:47:35 c-237-177-40-045 kernel:  worker_thread+0x59d/0xec0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? __kthread_parkme+0xd9/0x1d0
      
      Fixes: 8300f225 ("net/mlx5e: Create new flow attr for multi table actions")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      be071cdb
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Check force-loopback RC QP capability independently from RoCE · c7dd225b
      Yevgeny Kliteynik authored
      SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB
      (loopback), and FL (force-loopback) QP is preferred for performance. FL is
      available when RoCE is enabled or disabled based on RoCE caps.
      This patch adds reading of FL capability from HCA caps in addition to the
      existing reading from RoCE caps, thus fixing the case where we didn't
      have loopback enabled when RoCE was disabled.
      
      Fixes: 7304d603 ("net/mlx5: DR, Add support for force-loopback QP")
      Signed-off-by: default avatarItamar Gozlan <igozlan@nvidia.com>
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c7dd225b
    • Erez Shitrit's avatar
      net/mlx5: DR, Fix crc32 calculation to work on big-endian (BE) CPUs · 1e5daf55
      Erez Shitrit authored
      When calculating crc for hash index we use the function crc32 that
      calculates for little-endian (LE) arch.
      Then we convert it to network endianness using htonl(), but it's wrong
      to do the conversion in BE archs since the crc32 value is already LE.
      
      The solution is to switch the bytes from the crc result for all types
      of arc.
      
      Fixes: 40416d8e ("net/mlx5: DR, Replace CRC32 implementation to use kernel lib")
      Signed-off-by: default avatarErez Shitrit <erezsh@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1e5daf55
    • Shay Drory's avatar
      net/mlx5: Handle pairing of E-switch via uplink un/load APIs · 2be5bd42
      Shay Drory authored
      In case user switch a device from switchdev mode to legacy mode, mlx5
      first unpair the E-switch and afterwards unload the uplink vport.
      From the other hand, in case user remove or reload a device, mlx5
      first unload the uplink vport and afterwards unpair the E-switch.
      
      The latter is causing a bug[1], hence, handle pairing of E-switch as
      part of uplink un/load APIs.
      
      [1]
      In case VF_LAG is used, every tc fdb flow is duplicated to the peer
      esw. However, the original esw keeps a pointer to this duplicated
      flow, not the peer esw.
      e.g.: if user create tc fdb flow over esw0, the flow is duplicated
      over esw1, in FW/HW, but in SW, esw0 keeps a pointer to the duplicated
      flow.
      During module unload while a peer tc fdb flow is still offloaded, in
      case the first device to be removed is the peer device (esw1 in the
      example above), the peer net-dev is destroyed, and so the mlx5e_priv
      is memset to 0.
      Afterwards, the peer device is trying to unpair himself from the
      original device (esw0 in the example above). Unpair API invoke the
      original device to clear peer flow from its eswitch (esw0), but the
      peer flow, which is stored over the original eswitch (esw0), is
      trying to use the peer mlx5e_priv, which is memset to 0 and result in
      bellow kernel-oops.
      
      [  157.964081 ] BUG: unable to handle page fault for address: 000000000002ce60
      [  157.964662 ] #PF: supervisor read access in kernel mode
      [  157.965123 ] #PF: error_code(0x0000) - not-present page
      [  157.965582 ] PGD 0 P4D 0
      [  157.965866 ] Oops: 0000 [#1] SMP
      [  157.967670 ] RIP: 0010:mlx5e_tc_del_fdb_flow+0x48/0x460 [mlx5_core]
      [  157.976164 ] Call Trace:
      [  157.976437 ]  <TASK>
      [  157.976690 ]  __mlx5e_tc_del_fdb_peer_flow+0xe6/0x100 [mlx5_core]
      [  157.977230 ]  mlx5e_tc_clean_fdb_peer_flows+0x67/0x90 [mlx5_core]
      [  157.977767 ]  mlx5_esw_offloads_unpair+0x2d/0x1e0 [mlx5_core]
      [  157.984653 ]  mlx5_esw_offloads_devcom_event+0xbf/0x130 [mlx5_core]
      [  157.985212 ]  mlx5_devcom_send_event+0xa3/0xb0 [mlx5_core]
      [  157.985714 ]  esw_offloads_disable+0x5a/0x110 [mlx5_core]
      [  157.986209 ]  mlx5_eswitch_disable_locked+0x152/0x170 [mlx5_core]
      [  157.986757 ]  mlx5_eswitch_disable+0x51/0x80 [mlx5_core]
      [  157.987248 ]  mlx5_unload+0x2a/0xb0 [mlx5_core]
      [  157.987678 ]  mlx5_uninit_one+0x5f/0xd0 [mlx5_core]
      [  157.988127 ]  remove_one+0x64/0xe0 [mlx5_core]
      [  157.988549 ]  pci_device_remove+0x31/0xa0
      [  157.988933 ]  device_release_driver_internal+0x18f/0x1f0
      [  157.989402 ]  driver_detach+0x3f/0x80
      [  157.989754 ]  bus_remove_driver+0x70/0xf0
      [  157.990129 ]  pci_unregister_driver+0x34/0x90
      [  157.990537 ]  mlx5_cleanup+0xc/0x1c [mlx5_core]
      [  157.990972 ]  __x64_sys_delete_module+0x15a/0x250
      [  157.991398 ]  ? exit_to_user_mode_prepare+0xea/0x110
      [  157.991840 ]  do_syscall_64+0x3d/0x90
      [  157.992198 ]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: 04de7dda ("net/mlx5e: Infrastructure for duplicated offloading of TC flows")
      Fixes: 1418ddd9 ("net/mlx5e: Duplicate offloaded TC eswitch rules under uplink LAG")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2be5bd42
    • Shay Drory's avatar
      net/mlx5: Collect command failures data only for known commands · 2a0a935f
      Shay Drory authored
      DEVX can issue a general command, which is not used by mlx5 driver.
      In case such command is failed, mlx5 is trying to collect the failure
      data, However, mlx5 doesn't create a storage for this command, since
      mlx5 doesn't use it. This lead to array-index-out-of-bounds error.
      
      Fix it by checking whether the command is known before collecting the
      failure data.
      
      Fixes: 34f46ae0 ("net/mlx5: Add command failures data to debugfs")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2a0a935f
    • Chuck Lever's avatar
      net/handshake: Fix sock->file allocation · 18c40a1c
      Chuck Lever authored
      	sock->file = sock_alloc_file(sock, O_NONBLOCK, NULL);
      	^^^^                         ^^^^
      
      sock_alloc_file() calls release_sock() on error but the left hand
      side of the assignment dereferences "sock".  This isn't the bug and
      I didn't report this earlier because there is an assert that it
      doesn't fail.
      
      net/handshake/handshake-test.c:221 handshake_req_submit_test4() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:233 handshake_req_submit_test4() warn: 'req' was already freed.
      net/handshake/handshake-test.c:254 handshake_req_submit_test5() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:290 handshake_req_submit_test6() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:321 handshake_req_cancel_test1() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:355 handshake_req_cancel_test2() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:367 handshake_req_cancel_test2() warn: 'req' was already freed.
      net/handshake/handshake-test.c:395 handshake_req_cancel_test3() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:407 handshake_req_cancel_test3() warn: 'req' was already freed.
      net/handshake/handshake-test.c:451 handshake_req_destroy_test1() error: dereferencing freed memory 'sock'
      net/handshake/handshake-test.c:463 handshake_req_destroy_test1() warn: 'req' was already freed.
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Fixes: 88232ec1 ("net/handshake: Add Kunit tests for the handshake consumer API")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Link: https://lore.kernel.org/r/168451609436.45209.15407022385441542980.stgit@oracle-102.nfsv4bat.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18c40a1c
    • Chuck Lever's avatar
      net/handshake: Squelch allocation warning during Kunit test · b21c7ba6
      Chuck Lever authored
      The "handshake_req_alloc excessive privsize" kunit test is intended
      to check what happens when the maximum privsize is exceeded. The
      WARN_ON_ONCE_GFP at mm/page_alloc.c:4744 can be disabled safely for
      this test.
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Fixes: 88232ec1 ("net/handshake: Add Kunit tests for the handshake consumer API")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Link: https://lore.kernel.org/r/168451636052.47152.9600443326570457947.stgit@oracle-102.nfsv4bat.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b21c7ba6
    • Christophe JAILLET's avatar
      3c589_cs: Fix an error handling path in tc589_probe() · 640bf95b
      Christophe JAILLET authored
      Should tc589_config() fail, some resources need to be released as already
      done in the remove function.
      
      Fixes: 15b99ac1 ("[PATCH] pcmcia: add return value to _config() functions")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/d8593ae867b24c79063646e36f9b18b0790107cb.1684575975.git.christophe.jaillet@wanadoo.frSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      640bf95b
    • Christophe JAILLET's avatar
      forcedeth: Fix an error handling path in nv_probe() · 5b17a497
      Christophe JAILLET authored
      If an error occures after calling nv_mgmt_acquire_sema(), it should be
      undone with a corresponding nv_mgmt_release_sema() call.
      
      Add it in the error handling path of the probe as already done in the
      remove function.
      
      Fixes: cac1c52c ("forcedeth: mgmt unit interface")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Acked-by: default avatarZhu Yanjun <zyjzyj2000@gmail.com>
      Link: https://lore.kernel.org/r/355e9a7d351b32ad897251b6f81b5886fcdc6766.1684571393.git.christophe.jaillet@wanadoo.frSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5b17a497
  4. 22 May, 2023 1 commit
    • Xin Long's avatar
      sctp: fix an issue that plpmtu can never go to complete state · 6ca328e9
      Xin Long authored
      When doing plpmtu probe, the probe size is growing every time when it
      receives the ACK during the Search state until the probe fails. When
      the failure occurs, pl.probe_high is set and it goes to the Complete
      state.
      
      However, if the link pmtu is huge, like 65535 in loopback_dev, the probe
      eventually keeps using SCTP_MAX_PLPMTU as the probe size and never fails.
      Because of that, pl.probe_high can not be set, and the plpmtu probe can
      never go to the Complete state.
      
      Fix it by setting pl.probe_high to SCTP_MAX_PLPMTU when the probe size
      grows to SCTP_MAX_PLPMTU in sctp_transport_pl_recv(). Also, not allow
      the probe size greater than SCTP_MAX_PLPMTU in the Complete state.
      
      Fixes: b87641af ("sctp: do state transition when a probe succeeds on HB ACK recv path")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ca328e9
  5. 20 May, 2023 2 commits
    • Jakub Kicinski's avatar
      Merge tag 'for-net-2023-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · 67caf26d
      Jakub Kicinski authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - Fix compiler warnings on btnxpuart
       - Fix potential double free on hci_conn_unlink
       - Fix UAF on hci_conn_hash_flush
      
      * tag 'for-net-2023-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: btnxpuart: Fix compiler warnings
        Bluetooth: Unlink CISes when LE disconnects in hci_conn_del
        Bluetooth: Fix UAF in hci_conn_hash_flush again
        Bluetooth: Refcnt drop must be placed last in hci_conn_unlink
        Bluetooth: Fix potential double free caused by hci_conn_unlink
      ====================
      
      Link: https://lore.kernel.org/r/20230519233056.2024340-1-luiz.dentz@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      67caf26d
    • Taehee Yoo's avatar
      net: fix stack overflow when LRO is disabled for virtual interfaces · ae9b15fb
      Taehee Yoo authored
      When the virtual interface's feature is updated, it synchronizes the
      updated feature for its own lower interface.
      This propagation logic should be worked as the iteration, not recursively.
      But it works recursively due to the netdev notification unexpectedly.
      This problem occurs when it disables LRO only for the team and bonding
      interface type.
      
             team0
               |
        +------+------+-----+-----+
        |      |      |     |     |
      team1  team2  team3  ...  team200
      
      If team0's LRO feature is updated, it generates the NETDEV_FEAT_CHANGE
      event to its own lower interfaces(team1 ~ team200).
      It is worked by netdev_sync_lower_features().
      So, the NETDEV_FEAT_CHANGE notification logic of each lower interface
      work iteratively.
      But generated NETDEV_FEAT_CHANGE event is also sent to the upper
      interface too.
      upper interface(team0) generates the NETDEV_FEAT_CHANGE event for its own
      lower interfaces again.
      lower and upper interfaces receive this event and generate this
      event again and again.
      So, the stack overflow occurs.
      
      But it is not the infinite loop issue.
      Because the netdev_sync_lower_features() updates features before
      generating the NETDEV_FEAT_CHANGE event.
      Already synchronized lower interfaces skip notification logic.
      So, it is just the problem that iteration logic is changed to the
      recursive unexpectedly due to the notification mechanism.
      
      Reproducer:
      
      ip link add team0 type team
      ethtool -K team0 lro on
      for i in {1..200}
      do
              ip link add team$i master team0 type team
              ethtool -K team$i lro on
      done
      
      ethtool -K team0 lro off
      
      In order to fix it, the notifier_ctx member of bonding/team is introduced.
      
      Reported-by: syzbot+60748c96cf5c6df8e581@syzkaller.appspotmail.com
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20230517143010.3596250-1-ap420073@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ae9b15fb
  6. 19 May, 2023 6 commits