• Ruozhu Li's avatar
    nvme: fix regression when disconnect a recovering ctrl · f7f70f4a
    Ruozhu Li authored
    We encountered a problem that the disconnect command hangs.
    After analyzing the log and stack, we found that the triggering
    process is as follows:
    CPU0                          CPU1
                                    nvme_rdma_error_recovery_work
                                      nvme_rdma_teardown_io_queues
    nvme_do_delete_ctrl                 nvme_stop_queues
      nvme_remove_namespaces
      --clear ctrl->namespaces
                                        nvme_start_queues
                                        --no ns in ctrl->namespaces
        nvme_ns_remove                  return(because ctrl is deleting)
          blk_freeze_queue
            blk_mq_freeze_queue_wait
            --wait for ns to unquiesce to clean infligt IO, hang forever
    
    This problem was not found in older kernels because we will flush
    err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not
    seem to be modified for functional reasons, the patch can be revert
    to solve the problem.
    
    Revert commit 794a4cb3 ("nvme: remove the .stop_ctrl callout")
    Signed-off-by: default avatarRuozhu Li <liruozhu@huawei.com>
    Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    f7f70f4a
rdma.c 65.5 KB