- 20 Oct, 2021 1 commit
-
-
Ming Lei authored
Apply the added two APIs to quiesce/unquiesce admin queue. Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211014081710.1871747-3-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 18 Oct, 2021 2 commits
-
-
Jens Axboe authored
struct io_comp_batch contains a list head and a completion handler, which will allow completions to more effciently completed batches of IO. For now, no functional changes in this patch, we just define the io_comp_batch structure and add the argument to the file_operations iopoll handler. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Split the integrity/metadata handling definitions out into a new header. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 14 Sep, 2021 1 commit
-
-
Ruozhu Li authored
We should always destroy cm_id before destroy qp to avoid to get cma event after qp was destroyed, which may lead to use after free. In RDMA connection establishment error flow, don't destroy qp in cm event handler.Just report cm_error to upper level, qp will be destroy in nvme_rdma_alloc_queue() after destroy cm id. Signed-off-by:
Ruozhu Li <liruozhu@huawei.com> Reviewed-by:
Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 16 Aug, 2021 2 commits
-
-
Ruozhu Li authored
We update ctrl->queue_count and schedule another reconnect when io queue count is zero.But we will never try to create any io queue in next reco- nnection, because ctrl->queue_count already set to zero.We will end up having an admin-only session in Live state, which is exactly what we try to avoid in the original patch. Update ctrl->queue_count after queue_count zero checking to fix it. Signed-off-by:
Ruozhu Li <liruozhu@huawei.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We cannot detect a (perhaps buggy) controller that is sending us a completion for a request that was already completed (for example sending a completion twice), this phenomenon was seen in the wild a few times. So to protect against this, we use the upper 4 msbits of the nvme sqe command_id to use as a 4-bit generation counter and verify it matches the existing request generation that is incrementing on every execution. The 16-bit command_id structure now is constructed by: | xxxx | xxxxxxxxxxxx | gen request tag This means that we are giving up some possible queue depth as 12 bits allow for a maximum queue depth of 4095 instead of 65536, however we never create such long queues anyways so no real harm done. Suggested-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Acked-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Daniel Wagner <dwagner@suse.de> Tested-by:
Daniel Wagner <dwagner@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 30 Jun, 2021 1 commit
-
-
Keith Busch authored
The generic blk_execute_rq() knows how to handle polled completions. Use that instead of implementing an nvme specific handler. Signed-off-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 03 Jun, 2021 1 commit
-
-
Colin Ian King authored
The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Signed-off-by:
Colin Ian King <colin.king@canonical.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 31 May, 2021 1 commit
-
-
Sagi Grimberg authored
We have only 2 inline sg entries and we allow 4 sg entries for the send wr sge. Larger sgls entries will be chained. However when we build in-capsule send wr sge, we iterate without taking into account that the sgl may be chained and still fit in-capsule (which can happen if the sgl is bigger than 2, but lower-equal to 4). Fix in-capsule data mapping to correctly iterate chained sgls. Fixes: 38e18002 ("nvme-rdma: Avoid preallocating big SGL for data") Reported-by:
Walker, Benjamin <benjamin.walker@intel.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 04 May, 2021 1 commit
-
-
Tao Chiu authored
queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready, e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow restarts the admin queue, users are able to submit admin commands to a controller before reset_work() completes. Commands submitted under this condition may interfere with commands that performs identify, IO queue setup in reset_work(), and may result in a hang described in the following patch. As seen in the fabrics, user commands are prevented from being executed under inproper controller states. We may reuse this logic to maintain a clear admin queue during reset_work(). Signed-off-by:
Tao Chiu <taochiu@synology.com> Signed-off-by:
Cody Wong <codywong@synology.com> Reviewed-by:
Leon Chien <leonchien@synology.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 02 Apr, 2021 2 commits
-
-
Keith Busch authored
All nvme transport drivers preallocate an nvme command for each request. Assume to use that command for nvme_setup_cmd() instead of requiring drivers pass a pointer to it. All nvme drivers must initialize the generic nvme_request 'cmd' to point to the transport's preallocated nvme_command. The generic nvme_request cmd pointer had previously been used only as a temporary copy for passthrough commands. Since it now points to the command that gets dispatched, passthrough commands must directly set it up prior to executing the request. Signed-off-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Jens Axboe <axboe@kernel.dk> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
This is a prep patch so that we can move the identify data structure related code initialization from nvme_init_identify() into a helper. Rename the function nvmet_init_identify() to nvmet_init_ctrl_finish(). Next patch will move the nvme_id_ctrl related initialization from newly renamed function nvme_init_ctrl_finish() into the nvme_init_identify() helper. Signed-off-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 18 Mar, 2021 2 commits
-
-
Sagi Grimberg authored
We only setup io queues for nvme controllers, and it makes absolutely no sense to allow a controller (re)connect without any I/O queues. If we happen to fail setting the queue count for any reason, we should not allow this to be a successful reconnect as I/O has no chance in going through. Instead just fail and schedule another reconnect. Reported-by:
Chao Leng <lengchao@huawei.com> Fixes: 71102307 ("nvme-rdma: add a NVMe over Fabrics RDMA host driver") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Fabrics drivers currently reserve two tags on the admin queue. But given that the connect command is only run on a freshly created queue or after all commands have been force aborted we only need to reserve a single tag. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Daniel Wagner <dwagner@suse.de>
-
- 10 Feb, 2021 1 commit
-
-
Chao Leng authored
nvme_rdma_post_send failing is a path related error and should bounce to another path when using nvme-multipath. Call nvme_host_path_error when nvme_rdma_post_send returns -EIO to ensure nvme_complete_rq gets invoked to fail over to another path if there is one. Signed-off-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 02 Feb, 2021 2 commits
-
-
Chao Leng authored
Use nvme_cancel_tagset and nvme_cancel_admin_tagset to clean code for tear down process. Signed-off-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Chao Leng authored
A crash happens when inject failed reconnection. If reconnect failed after start io queues, the queues will be unquiesced and new requests continue to be delivered. Reconnection error handling process directly free queues without cancel suspend requests. The suppend request will time out, and then crash due to use the queue after free. Add sync queues and cancel suppend requests for reconnection error handling. Signed-off-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 25 Jan, 2021 1 commit
-
-
Christoph Hellwig authored
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 18 Jan, 2021 1 commit
-
-
Chao Leng authored
A crash happens when inject completing request long time(nearly 30s). Each name space has a request queue, when inject completing request long time, multi request queues may have time out requests at the same time, nvme_rdma_timeout will execute concurrently. Multi requests in different request queues may be queued in the same rdma queue, multi nvme_rdma_timeout may call nvme_rdma_stop_queue at the same time. The first nvme_rdma_timeout will clear NVME_RDMA_Q_LIVE and continue stopping the rdma queue(drain qp), but the others check NVME_RDMA_Q_LIVE is already cleared, and then directly complete the requests, complete request before the qp is fully drained may lead to a use-after-free condition. Add a multex lock to serialize nvme_rdma_stop_queue. Signed-off-by:
Chao Leng <lengchao@huawei.com> Tested-by:
Israel Rukshin <israelr@nvidia.com> Reviewed-by:
Israel Rukshin <israelr@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 01 Dec, 2020 1 commit
-
-
Chaitanya Kulkarni authored
This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to make consistent with NVME_IO_TIMEOUT. Signed-off-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 12 Nov, 2020 1 commit
-
-
Christoph Hellwig authored
->dma_device is a private implementation detail of the RDMA core. Use the ibdev_to_node helper to get the NUMA node for a ib_device instead of poking into ->dma_device. Link: https://lore.kernel.org/r/20201106181941.1878556-5-hch@lst.de Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com>
-
- 03 Nov, 2020 2 commits
-
-
Sagi Grimberg authored
The request may be executed asynchronously, and rq->state may be changed to IDLE. To avoid repeated request completion, only MQ_RQ_COMPLETE of rq->state is checked in nvme_rdma_complete_timed_out. It is not safe, so need adding check IDLE for rq->state. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Chao Leng authored
Now use teardown_lock to serialize for time out and tear down. This may cause abnormal: first cancel all request in tear down, then time out may complete the request again, but the request may already be freed or restarted. To avoid race between time out and tear down, in tear down process, first we quiesce the queue, and then delete the timer and cancel the time out work for the queue. At the same time we need to delete teardown_lock. Signed-off-by:
Chao Leng <lengchao@huawei.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 28 Oct, 2020 1 commit
-
-
Jason Gunthorpe authored
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the handler triggers a completion and another thread does rdma_connect() or the handler directly calls rdma_connect(). In all cases rdma_connect() needs to hold the handler_mutex, but when handler's are invoked this is already held by the core code. This causes ULPs using the 2nd method to deadlock. Provide a rdma_connect_locked() and have all ULPs call it from their handlers. Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com Reported-and-tested-by:
Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Fixes: 2a7cec53 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state") Acked-by:
Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by:
Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com>
-
- 27 Oct, 2020 1 commit
-
-
zhenwei pi authored
Receiving a zero length message leads to the following warnings because the CQE is processed twice: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0xd9/0xe0 Call Trace: <IRQ> nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma] __ib_process_cq+0x76/0x150 [ib_core] ... Sanity check the received data length, to avoids this. Thanks to Chao Leng & Sagi for suggestions. Signed-off-by:
zhenwei pi <pizhenwei@bytedance.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 22 Oct, 2020 2 commits
-
-
Chao Leng authored
A crash happened due to injecting error test. When a CQE has incorrect command id due do an error injection, the host may find a request which is already freed. Dereferencing req->mr->rkey causes a crash in nvme_rdma_process_nvme_rsp because the mr is already freed. Add a check for the mr to fix it. Signed-off-by:
Chao Leng <lengchao@huawei.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Chao Leng authored
A crash can happened when a connect is rejected. The host establishes the connection after received ConnectReply, and then continues to send the fabrics Connect command. If the controller does not receive the ReadyToUse capsule, host may receive a ConnectReject reply. Call nvme_rdma_destroy_queue_ib after the host received the RDMA_CM_EVENT_REJECTED event. Then when the fabrics Connect command times out, nvme_rdma_timeout calls nvme_rdma_complete_rq to fail the request. A crash happenes due to use after free in nvme_rdma_complete_rq. nvme_rdma_destroy_queue_ib is redundant when handling the RDMA_CM_EVENT_REJECTED event as nvme_rdma_destroy_queue_ib is already called in connection failure handler. Signed-off-by:
Chao Leng <lengchao@huawei.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 08 Sep, 2020 1 commit
-
-
David Milburn authored
Cancel async event work in case async event has been queued up, and nvme_rdma_submit_async_event() runs after event has been freed. Signed-off-by:
David Milburn <dmilburn@redhat.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 28 Aug, 2020 3 commits
-
-
Sagi Grimberg authored
If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
James Smart <james.smart@broadcom.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
James Smart <james.smart@broadcom.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
- 23 Aug, 2020 1 commit
-
-
Gustavo A. R. Silva authored
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by:
Gustavo A. R. Silva <gustavoars@kernel.org>
-
- 21 Aug, 2020 1 commit
-
-
Christoph Hellwig authored
nvme_end_request is a bit misnamed, as it wraps around the blk_mq_complete_* API. It's semantics also are non-trivial, so give it a more descriptive name and add a comment explaining the semantics. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 29 Jul, 2020 3 commits
-
-
Sagi Grimberg authored
commit fe35ec58 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58 ("block: update hctx map when use multiple maps") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: 0d0b660f ("nvme: add ANA support") Reported-by:
Anton Eidelman <anton@lightbitslabs.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Yamin Friedman authored
Has the driver use shared CQs providing ~10%-20% improvement as seen in the patch introducing shared CQs. Instead of opening a CQ for each QP per controller connected, a CQ for each QP will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by:
Yamin Friedman <yaminf@mellanox.com> Signed-off-by:
Max Gurtovoy <maxg@mellanox.com> Reviewed-by:
Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 24 Jun, 2020 4 commits
-
-
Max Gurtovoy authored
The completion vector index that is given during CQ creation can't exceed the number of support vectors by the underlying RDMA device. This violation currently can accure, for example, in case one will try to connect with N regular read/write queues and M poll queues and the sum of N + M > num_supported_vectors. This will lead to failure in establish a connection to remote target. Instead, in that case, share a completion vector between queues. Signed-off-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Revert and incorret transformation that caused requests using remote invalidation to never complete. Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper") Reported-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use the new blk_mq_complete_request_remote helper to avoid an indirect function call in the completion fast path. Reviewed-by:
Daniel Wagner <dwagner@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Factor a small sniplet of duplicated code into a new helper in preparation for making this sniplet a little bit less trivial. Reviewed-by:
Daniel Wagner <dwagner@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-