• Israel Rukshin's avatar
    nvme-rdma: remove timeout for getting RDMA-CM established event · 0525af71
    Israel Rukshin authored
    
    
    In case many controllers start error recovery at the same time (i.e.,
    when port is down and up), they may never succeed to reconnect again.
    This is because the target can't handle all the connect requests at
    three seconds (the arbitrary value set today). Even if some of the
    connections are established, when a single queue fails to connect,
    all the controller's queues are destroyed as well. So, on the
    following reconnection attempts the number of connect requests may
    remain the same. To fix this, remove the timeout and wait for RDMA-CM
    event to abort/complete the connect request. RDMA-CM sends unreachable
    event when a timeout of ~90 seconds is expired. This approach is used
    at other RDMA-CM users like SRP and iSER at blocking mode. The commit
    also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.
    Signed-off-by: default avatarIsrael Rukshin <israelr@nvidia.com>
    Reviewed-by: default avatarMax Gurtovoy <mgurtovoy@nvidia.com>
    Acked-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    0525af71
rdma.c 65.4 KB