1. 17 Jul, 2023 7 commits
    • Selvin Xavier's avatar
      RDMA/bnxt_re: Fix hang during driver unload · 29900bf3
      Selvin Xavier authored
      Driver unload hits a hang during stress testing of load/unload.
      
      stack trace snippet -
      
      tasklet_kill at ffffffff9aabb8b2
      bnxt_qplib_nq_stop_irq at ffffffffc0a805fb [bnxt_re]
      bnxt_qplib_disable_nq at ffffffffc0a80c5b [bnxt_re]
      bnxt_re_dev_uninit at ffffffffc0a67d15 [bnxt_re]
      bnxt_re_remove_device at ffffffffc0a6af1d [bnxt_re]
      
      tasklet_kill can hang if the tasklet is scheduled after it is disabled.
      
      Modified the sequences to disable the interrupt first and synchronize
      irq before disabling the tasklet.
      
      Fixes: 1ac5a404 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
      Signed-off-by: default avatarKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: default avatarSelvin Xavier <selvin.xavier@broadcom.com>
      Link: https://lore.kernel.org/r/1689322969-25402-3-git-send-email-selvin.xavier@broadcom.comSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      29900bf3
    • Kashyap Desai's avatar
      RDMA/bnxt_re: Prevent handling any completions after qp destroy · b5bbc655
      Kashyap Desai authored
      HW may generate completions that indicates QP is destroyed.
      Driver should not be scheduling any more completion handlers
      for this QP, after the QP is destroyed. Since CQs are active
      during the QP destroy, driver may still schedule completion
      handlers. This can cause a race where the destroy_cq and poll_cq
      running simultaneously.
      
      Snippet of kernel panic while doing bnxt_re driver load unload in loop.
      This indicates a poll after the CQ is freed. 
      
      [77786.481636] Call Trace:
      [77786.481640]  <TASK>
      [77786.481644]  bnxt_re_poll_cq+0x14a/0x620 [bnxt_re]
      [77786.481658]  ? kvm_clock_read+0x14/0x30
      [77786.481693]  __ib_process_cq+0x57/0x190 [ib_core]
      [77786.481728]  ib_cq_poll_work+0x26/0x80 [ib_core]
      [77786.481761]  process_one_work+0x1e5/0x3f0
      [77786.481768]  worker_thread+0x50/0x3a0
      [77786.481785]  ? __pfx_worker_thread+0x10/0x10
      [77786.481790]  kthread+0xe2/0x110
      [77786.481794]  ? __pfx_kthread+0x10/0x10
      [77786.481797]  ret_from_fork+0x2c/0x50
      
      To avoid this, complete all completion handlers before returning the
      destroy QP. If free_cq is called soon after destroy_qp,  IB stack
      will cancel the CQ work before invoking the destroy_cq verb and
      this will prevent any race mentioned.
      
      Fixes: 1ac5a404 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
      Signed-off-by: default avatarKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: default avatarSelvin Xavier <selvin.xavier@broadcom.com>
      Link: https://lore.kernel.org/r/1689322969-25402-2-git-send-email-selvin.xavier@broadcom.comSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      b5bbc655
    • Thomas Bogendoerfer's avatar
      RDMA/mthca: Fix crash when polling CQ for shared QPs · dc52aadb
      Thomas Bogendoerfer authored
      Commit 21c2fe94 ("RDMA/mthca: Combine special QP struct with mthca QP")
      introduced a new struct mthca_sqp which doesn't contain struct mthca_qp
      any longer. Placing a pointer of this new struct into qptable leads
      to crashes, because mthca_poll_one() expects a qp pointer. Fix this
      by putting the correct pointer into qptable.
      
      Fixes: 21c2fe94 ("RDMA/mthca: Combine special QP struct with mthca QP")
      Signed-off-by: default avatarThomas Bogendoerfer <tbogendoerfer@suse.de>
      Link: https://lore.kernel.org/r/20230713141658.9426-1-tbogendoerfer@suse.deSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      dc52aadb
    • Shiraz Saleem's avatar
      RDMA/core: Update CMA destination address on rdma_resolve_addr · 0e158630
      Shiraz Saleem authored
      8d037973 ("RDMA/core: Refactor rdma_bind_addr") intoduces as regression
      on irdma devices on certain tests which uses rdma CM, such as cmtime.
      
      No connections can be established with the MAD QP experiences a fatal
      error on the active side.
      
      The cma destination address is not updated with the dst_addr when ULP
      on active side calls rdma_bind_addr followed by rdma_resolve_addr.
      The id_priv state is 'bound' in resolve_prepare_src and update is skipped.
      
      This leaves the dgid passed into irdma driver to create an Address Handle
      (AH) for the MAD QP at 0. The create AH descriptor as well as the ARP cache
      entry is invalid and HW throws an asynchronous events as result.
      
      [ 1207.656888] resolve_prepare_src caller: ucma_resolve_addr+0xff/0x170 [rdma_ucm] daddr=200.0.4.28 id_priv->state=7
      [....]
      [ 1207.680362] ice 0000:07:00.1 rocep7s0f1: caller: irdma_create_ah+0x3e/0x70 [irdma] ah_id=0 arp_idx=0 dest_ip=0.0.0.0
      destMAC=00:00:64:ca:b7:52 ipvalid=1 raw=0000:0000:0000:0000:0000:ffff:0000:0000
      [ 1207.682077] ice 0000:07:00.1 rocep7s0f1: abnormal ae_id = 0x401 bool qp=1 qp_id = 1, ae_src=5
      [ 1207.691657] infiniband rocep7s0f1: Fatal error (1) on MAD QP (1)
      
      Fix this by updating the CMA destination address when the ULP calls
      a resolve address with the CM state already bound.
      
      Fixes: 8d037973 ("RDMA/core: Refactor rdma_bind_addr")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230712234133.1343-1-shiraz.saleem@intel.comSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      0e158630
    • Shiraz Saleem's avatar
      RDMA/irdma: Fix data race on CQP request done · f0842bb3
      Shiraz Saleem authored
      KCSAN detects a data race on cqp_request->request_done memory location
      which is accessed locklessly in irdma_handle_cqp_op while being
      updated in irdma_cqp_ce_handler.
      
      Annotate lockless intent with READ_ONCE/WRITE_ONCE to avoid any
      compiler optimizations like load fusing and/or KCSAN warning.
      
      [222808.417128] BUG: KCSAN: data-race in irdma_cqp_ce_handler [irdma] / irdma_wait_event [irdma]
      
      [222808.417532] write to 0xffff8e44107019dc of 1 bytes by task 29658 on cpu 5:
      [222808.417610]  irdma_cqp_ce_handler+0x21e/0x270 [irdma]
      [222808.417725]  cqp_compl_worker+0x1b/0x20 [irdma]
      [222808.417827]  process_one_work+0x4d1/0xa40
      [222808.417835]  worker_thread+0x319/0x700
      [222808.417842]  kthread+0x180/0x1b0
      [222808.417852]  ret_from_fork+0x22/0x30
      
      [222808.417918] read to 0xffff8e44107019dc of 1 bytes by task 29688 on cpu 1:
      [222808.417995]  irdma_wait_event+0x1e2/0x2c0 [irdma]
      [222808.418099]  irdma_handle_cqp_op+0xae/0x170 [irdma]
      [222808.418202]  irdma_cqp_cq_destroy_cmd+0x70/0x90 [irdma]
      [222808.418308]  irdma_puda_dele_rsrc+0x46d/0x4d0 [irdma]
      [222808.418411]  irdma_rt_deinit_hw+0x179/0x1d0 [irdma]
      [222808.418514]  irdma_ib_dealloc_device+0x11/0x40 [irdma]
      [222808.418618]  ib_dealloc_device+0x2a/0x120 [ib_core]
      [222808.418823]  __ib_unregister_device+0xde/0x100 [ib_core]
      [222808.418981]  ib_unregister_device+0x22/0x40 [ib_core]
      [222808.419142]  irdma_ib_unregister_device+0x70/0x90 [irdma]
      [222808.419248]  i40iw_close+0x6f/0xc0 [irdma]
      [222808.419352]  i40e_client_device_unregister+0x14a/0x180 [i40e]
      [222808.419450]  i40iw_remove+0x21/0x30 [irdma]
      [222808.419554]  auxiliary_bus_remove+0x31/0x50
      [222808.419563]  device_remove+0x69/0xb0
      [222808.419572]  device_release_driver_internal+0x293/0x360
      [222808.419582]  driver_detach+0x7c/0xf0
      [222808.419592]  bus_remove_driver+0x8c/0x150
      [222808.419600]  driver_unregister+0x45/0x70
      [222808.419610]  auxiliary_driver_unregister+0x16/0x30
      [222808.419618]  irdma_exit_module+0x18/0x1e [irdma]
      [222808.419733]  __do_sys_delete_module.constprop.0+0x1e2/0x310
      [222808.419745]  __x64_sys_delete_module+0x1b/0x30
      [222808.419755]  do_syscall_64+0x39/0x90
      [222808.419763]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      [222808.419829] value changed: 0x01 -> 0x03
      
      Fixes: 915cc7ac ("RDMA/irdma: Add miscellaneous utility definitions")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230711175253.1289-4-shiraz.saleem@intel.comSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      f0842bb3
    • Shiraz Saleem's avatar
      RDMA/irdma: Fix data race on CQP completion stats · f2c30378
      Shiraz Saleem authored
      CQP completion statistics is read lockesly in irdma_wait_event and
      irdma_check_cqp_progress while it can be updated in the completion
      thread irdma_sc_ccq_get_cqe_info on another CPU as KCSAN reports.
      
      Make completion statistics an atomic variable to reflect coherent updates
      to it. This will also avoid load/store tearing logic bug potentially
      possible by compiler optimizations.
      
      [77346.170861] BUG: KCSAN: data-race in irdma_handle_cqp_op [irdma] / irdma_sc_ccq_get_cqe_info [irdma]
      
      [77346.171383] write to 0xffff8a3250b108e0 of 8 bytes by task 9544 on cpu 4:
      [77346.171483]  irdma_sc_ccq_get_cqe_info+0x27a/0x370 [irdma]
      [77346.171658]  irdma_cqp_ce_handler+0x164/0x270 [irdma]
      [77346.171835]  cqp_compl_worker+0x1b/0x20 [irdma]
      [77346.172009]  process_one_work+0x4d1/0xa40
      [77346.172024]  worker_thread+0x319/0x700
      [77346.172037]  kthread+0x180/0x1b0
      [77346.172054]  ret_from_fork+0x22/0x30
      
      [77346.172136] read to 0xffff8a3250b108e0 of 8 bytes by task 9838 on cpu 2:
      [77346.172234]  irdma_handle_cqp_op+0xf4/0x4b0 [irdma]
      [77346.172413]  irdma_cqp_aeq_cmd+0x75/0xa0 [irdma]
      [77346.172592]  irdma_create_aeq+0x390/0x45a [irdma]
      [77346.172769]  irdma_rt_init_hw.cold+0x212/0x85d [irdma]
      [77346.172944]  irdma_probe+0x54f/0x620 [irdma]
      [77346.173122]  auxiliary_bus_probe+0x66/0xa0
      [77346.173137]  really_probe+0x140/0x540
      [77346.173154]  __driver_probe_device+0xc7/0x220
      [77346.173173]  driver_probe_device+0x5f/0x140
      [77346.173190]  __driver_attach+0xf0/0x2c0
      [77346.173208]  bus_for_each_dev+0xa8/0xf0
      [77346.173225]  driver_attach+0x29/0x30
      [77346.173240]  bus_add_driver+0x29c/0x2f0
      [77346.173255]  driver_register+0x10f/0x1a0
      [77346.173272]  __auxiliary_driver_register+0xbc/0x140
      [77346.173287]  irdma_init_module+0x55/0x1000 [irdma]
      [77346.173460]  do_one_initcall+0x7d/0x410
      [77346.173475]  do_init_module+0x81/0x2c0
      [77346.173491]  load_module+0x1232/0x12c0
      [77346.173506]  __do_sys_finit_module+0x101/0x180
      [77346.173522]  __x64_sys_finit_module+0x3c/0x50
      [77346.173538]  do_syscall_64+0x39/0x90
      [77346.173553]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      [77346.173634] value changed: 0x0000000000000094 -> 0x0000000000000095
      
      Fixes: 915cc7ac ("RDMA/irdma: Add miscellaneous utility definitions")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230711175253.1289-3-shiraz.saleem@intel.comSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      f2c30378
    • Shiraz Saleem's avatar
      RDMA/irdma: Add missing read barriers · 4984eb51
      Shiraz Saleem authored
      On code inspection, there are many instances in the driver where
      CEQE and AEQE fields written to by HW are read without guaranteeing
      that the polarity bit has been read and checked first.
      
      Add a read barrier to avoid reordering of loads on the CEQE/AEQE fields
      prior to checking the polarity bit.
      
      Fixes: 3f49d684 ("RDMA/irdma: Implement HW Admin Queue OPs")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230711175253.1289-2-shiraz.saleem@intel.comSigned-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      4984eb51
  2. 12 Jul, 2023 1 commit
  3. 09 Jul, 2023 10 commits
  4. 08 Jul, 2023 22 commits