1. 20 Oct, 2021 1 commit
  2. 18 Oct, 2021 2 commits
  3. 14 Sep, 2021 1 commit
  4. 16 Aug, 2021 2 commits
  5. 30 Jun, 2021 1 commit
  6. 03 Jun, 2021 1 commit
  7. 31 May, 2021 1 commit
  8. 04 May, 2021 1 commit
  9. 02 Apr, 2021 2 commits
  10. 18 Mar, 2021 2 commits
  11. 10 Feb, 2021 1 commit
  12. 02 Feb, 2021 2 commits
  13. 25 Jan, 2021 1 commit
  14. 18 Jan, 2021 1 commit
    • Chao Leng's avatar
      nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout · 7674073b
      Chao Leng authored
      
      A crash happens when inject completing request long time(nearly 30s).
      Each name space has a request queue, when inject completing request long
      time, multi request queues may have time out requests at the same time,
      nvme_rdma_timeout will execute concurrently. Multi requests in different
      request queues may be queued in the same rdma queue, multi
      nvme_rdma_timeout may call nvme_rdma_stop_queue at the same time.
      The first nvme_rdma_timeout will clear NVME_RDMA_Q_LIVE and continue
      stopping the rdma queue(drain qp), but the others check NVME_RDMA_Q_LIVE
      is already cleared, and then directly complete the requests, complete
      request before the qp is fully drained may lead to a use-after-free
      condition.
      
      Add a multex lock to serialize nvme_rdma_stop_queue.
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      Tested-by: default avatarIsrael Rukshin <israelr@nvidia.com>
      Reviewed-by: default avatarIsrael Rukshin <israelr@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      7674073b
  15. 01 Dec, 2020 1 commit
  16. 12 Nov, 2020 1 commit
  17. 03 Nov, 2020 2 commits
  18. 28 Oct, 2020 1 commit
  19. 27 Oct, 2020 1 commit
  20. 22 Oct, 2020 2 commits
    • Chao Leng's avatar
      nvme-rdma: fix crash due to incorrect cqe · a87da50f
      Chao Leng authored
      
      A crash happened due to injecting error test.
      When a CQE has incorrect command id due do an error injection, the host
      may find a request which is already freed.  Dereferencing req->mr->rkey
      causes a crash in nvme_rdma_process_nvme_rsp because the mr is already
      freed.
      
      Add a check for the mr to fix it.
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a87da50f
    • Chao Leng's avatar
      nvme-rdma: fix crash when connect rejected · 43efdb8e
      Chao Leng authored
      
      A crash can happened when a connect is rejected.   The host establishes
      the connection after received ConnectReply, and then continues to send
      the fabrics Connect command.  If the controller does not receive the
      ReadyToUse capsule, host may receive a ConnectReject reply.
      
      Call nvme_rdma_destroy_queue_ib after the host received the
      RDMA_CM_EVENT_REJECTED event.  Then when the fabrics Connect command
      times out, nvme_rdma_timeout calls nvme_rdma_complete_rq to fail the
      request.  A crash happenes due to use after free in
      nvme_rdma_complete_rq.
      
      nvme_rdma_destroy_queue_ib is redundant when handling the
      RDMA_CM_EVENT_REJECTED event as nvme_rdma_destroy_queue_ib is already
      called in connection failure handler.
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      43efdb8e
  21. 08 Sep, 2020 1 commit
  22. 28 Aug, 2020 3 commits
    • Sagi Grimberg's avatar
      nvme-rdma: fix reset hang if controller died in the middle of a reset · 2362acb6
      Sagi Grimberg authored
      
      If the controller becomes unresponsive in the middle of a reset, we
      will hang because we are waiting for the freeze to complete, but that
      cannot happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to
      proceed (either schedule a reconnect of remove the controller).
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      2362acb6
    • Sagi Grimberg's avatar
      nvme-rdma: fix timeout handler · 0475a8dc
      Sagi Grimberg authored
      
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      0475a8dc
    • Sagi Grimberg's avatar
      nvme-rdma: serialize controller teardown sequences · 5110f402
      Sagi Grimberg authored
      
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      5110f402
  23. 23 Aug, 2020 1 commit
  24. 21 Aug, 2020 1 commit
  25. 29 Jul, 2020 3 commits
    • Sagi Grimberg's avatar
      nvme-rdma: fix controller reset hang during traffic · 9f98772b
      Sagi Grimberg authored
      commit fe35ec58 ("block: update hctx map when use multiple maps")
      exposed an issue where we may hang trying to wait for queue freeze
      during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
      queue maps (which we have now for default/read/poll) is attempting to
      freeze the queue. However we never started queue freeze when starting the
      reset, which means that we have inflight pending requests that entered the
      queue that we will not complete once the queue is quiesced.
      
      So start a freeze before we quiesce the queue, and unfreeze the queue
      after we successfully connected the I/O queues (and make sure to call
      blk_mq_update_nr_hw_queues only after we are sure that the queue was
      already frozen).
      
      This follows to how the pci driver handles resets.
      
      Fixes: fe35ec58
      
       ("block: update hctx map when use multiple maps")
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9f98772b
    • Sagi Grimberg's avatar
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg authored
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f
      
       ("nvme: add ANA support")
      Reported-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      ecca390e
    • Yamin Friedman's avatar
      nvme-rdma: use new shared CQ mechanism · 287f329e
      Yamin Friedman authored
      
      Has the driver use shared CQs providing ~10%-20% improvement as seen in
      the patch introducing shared CQs. Instead of opening a CQ for each QP
      per controller connected, a CQ for each QP will be provided by the RDMA
      core driver that will be shared between the QPs on that core reducing
      interrupt overhead.
      Signed-off-by: default avatarYamin Friedman <yaminf@mellanox.com>
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      287f329e
  26. 24 Jun, 2020 4 commits