- 03 Aug, 2018 1 commit
-
-
Jianchao Wang authored
[ Upstream commit 2e050f00 ] For any failure after nvme_rdma_start_queue in nvme_rdma_configure_admin_queue, the admin queue will be freed with the NVME_RDMA_Q_LIVE flag still set. Once nvme_rdma_stop_queue is invoked, that will cause a use-after-free. BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm] To fix it, call nvme_rdma_stop_queue for all the failed cases after nvme_rdma_start_queue. Signed-off-by:
Jianchao Wang <jianchao.w.wang@oracle.com> Suggested-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sasha Levin <alexander.levin@microsoft.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 09 Mar, 2018 1 commit
-
-
Sagi Grimberg authored
commit b4b591c8 upstream. The entire completions suppress mechanism is currently broken because the HCA might retry a send operation (due to dropped ack) after the nvme transaction has completed. In order to handle this, we signal all send completions and introduce a separate done handler for async events as they will be handled differently (as they don't include in-capsule data by definition). Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 03 Feb, 2018 2 commits
-
-
Sagi Grimberg authored
[ Upstream commit 4af7f7ff ] In order to guarantee that the HCA will never get an access violation (either from invalidated rkey or from iommu) when retrying a send operation we must complete a request only when both send completion and the nvme cqe has arrived. We need to set the send/recv completions flags atomically because we might have more than a single context accessing the request concurrently (one is cq irq-poll context and the other is user-polling used in IOCB_HIPRI). Only then we are safe to invalidate the rkey (if needed), unmap the host buffers, and complete the IO. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sasha Levin <alexander.levin@verizon.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sagi Grimberg authored
[ Upstream commit 48832f8d ] When the fabrics queue is not alive and fully functional, no commands should be allowed to pass but connect (which moves the queue to a fully functional state). Any other command should be failed, with either temporary status BLK_STS_RESOUCE or permanent status BLK_STS_IOERR. This is shared across all fabrics, hence move the check to fabrics library. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sasha Levin <alexander.levin@verizon.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 23 Oct, 2017 1 commit
-
-
Sagi Grimberg authored
nvme_rdma_queue_is_ready() fails requests in case a queue is not LIVE. If the controller is in RECONNECTING state, we might be in this state for a long time (until we successfully reconnect) and we are better off with failing the request fast. Otherwise, we fail with BLK_STS_RESOURCE to have the block layer try again soon. In case we are removing the controller when the admin queue is not LIVE, we will terminate the request with BLK_STS_RESOURCE but it happens before we call blk_mq_start_request() so the request timeout never expires, and the queue will never get back to LIVE (because we are removing the controller). This causes the removal operation to block infinitly [1]. Thus, if we are removing (state DELETING), and the queue is not LIVE, we need to fail the request permanently as there is no chance for it to ever complete successfully. [1] -- sysrq: SysRq : Show Blocked State task PC stack pid father kworker/u66:2 D 0 440 2 0x80000000 Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] Call Trace: __schedule+0x3e9/0xb00 schedule+0x40/0x90 schedule_timeout+0x221/0x580 io_schedule_timeout+0x1e/0x50 wait_for_completion_io_timeout+0x118/0x180 blk_execute_rq+0x86/0xc0 __nvme_submit_sync_cmd+0x89/0xf0 nvmf_reg_write32+0x4b/0x90 [nvme_fabrics] nvme_shutdown_ctrl+0x41/0xe0 nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma] nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma] nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma] process_one_work+0x1fd/0x630 worker_thread+0x1db/0x3b0 kthread+0x11e/0x150 ret_from_fork+0x27/0x40 01 D 0 2868 2862 0x00000000 Call Trace: __schedule+0x3e9/0xb00 schedule+0x40/0x90 schedule_timeout+0x260/0x580 wait_for_completion+0x108/0x170 flush_work+0x1e0/0x270 nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma] nvme_sysfs_delete+0x2a/0x40 dev_attr_store+0x18/0x30 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x124/0x1c0 __vfs_write+0x28/0x150 vfs_write+0xc7/0x1b0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x18/0xad -- Reported-by:
Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 19 Oct, 2017 2 commits
-
-
Sagi Grimberg authored
We should make sure to escelate allocation failures to prevent a use-after-free in nvmf_create_ctrl. Fixes: b28a308e ("nvme-rdma: move tagset allocation to a dedicated routine") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
The fact that we free the async event buffer in nvme_rdma_destroy_admin_queue can cause us to free it more than once because this happens in every reconnect attempt since commit 31fdf184. we rely on the queue state flags DELETING to avoid this for other resources. A more complete fix is to not destroy the admin/io queues unconditionally on every reconnect attempt, but its a bit more extensive and will go in the next release. Fixes: 31fdf184 ("nvme-rdma: reuse configure/destroy_admin_queue") Reported-by:
Yi Zhang <yi.zhang@redhat.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 25 Sep, 2017 2 commits
-
-
Sagi Grimberg authored
By calling nvme_stop_ctrl on a already failed controller will wait for the scan work to complete (only by identify timeout expiration which is 60 seconds). This is unnecessary when we already know that the controller has failed. Reported-by:
Yi Zhang <yizhan@redhat.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
If we failed to transition to state LIVE after a successful reconnect, then controller deletion already started. In this case there is no point moving forward with reconnect. Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 30 Aug, 2017 1 commit
-
-
Max Gurtovoy authored
Due to various page sizes in the system (IOMMU/device/kernel), we set the fabrics controller page size to 4k and block layer boundaries accordinglly. In architectures that uses different kernel page size we'll have a mismatch to the MR page size that may cause a mapping error. Update the MR page size to correspond to the core ctrl settings. Signed-off-by:
Max Gurtovoy <maxg@mellanox.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 28 Aug, 2017 14 commits
-
-
Max Gurtovoy authored
This patch slightly improves performance (mainly for small block sizes). Signed-off-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
To make the nvme_rdma_configure_admin_queue generic in preparation of moving it to common code. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
No need to queue an extra work to indirect controller removal, just call the ctrl remove routine. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
This should pair with nvme_rdma_stop_queue. While this is not a complete inverse, it still pairs up pretty well because in fabrics we don't have a disconnect capsule (yet) but we simply teardown the transport association. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Give it a name symmetric to nvme_rdma_free_queue. Also pass in the ctrl sqsize+1 and not the opts queue_size. And suppress a superflous failure message. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
If we move the queues from LIVE state, we might as well stop them (drain for rdma). Do it after we stop the request queues to prevent a stray request sneaking in .queue_rq after we stop the queue. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Make a symmetrical handling with admin queue. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
No need to open-code it. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We're not supposed to do that. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Mimic the pci driver as a controller disable might be more lightweight than a shutdown. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We always pair tagset allocation with rdma device reference and it shares some code, centralize it with an argument if its an admin or IO tagset. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Will be used when we centralize control flows. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We will call it from other places so avoid having to forward declare it. Also move it next to nvme_rdma_destroy_admin_queue. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Johannes Thumshirn authored
NVME_RDMA_MAX_SEGMENT_SIZE is not used anywhere, zap it. Signed-off-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 18 Aug, 2017 2 commits
-
-
Sagi Grimberg authored
Now that its not needed, we can simply not assign it. Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Doug Ledford <dledford@redhat.com>
-
Bart Van Assche authored
Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. Signed-off-by:
Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: James Smart <james.smart@broadcom.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 08 Aug, 2017 1 commit
-
-
Sagi Grimberg authored
Use the generic block layer affinity mapping helper. Also, limit nr_hw_queues to the rdma device number of irq vectors as we don't really need more. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Doug Ledford <dledford@redhat.com>
-
- 06 Jul, 2017 4 commits
-
-
Sagi Grimberg authored
When our RDMA queue-pair is torn down with high load of I/O traffic, we have no way of knowing if the memory region was actually registered by the reg_mr work request as it completion flushes with error (hw might have done it or not). So in order to not deal with all this uncertanty, we simply recycle the MR in reinit_request. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
Usually before we teardown the controller we want to: 1. complete/cancel any ctrl inflight works 2. remove ctrl namespaces (only for removal though, resets shouldn't remove any namespaces). but we do not want to destroy the controller device as we might use it for logging during the teardown stage. This patch adds nvme_start_ctrl() which queues inflight controller works (aen, ns scan, queue start and keep-alive if kato is set) and nvme_stop_ctrl() which cancels the works namespace removal is left to the callers to handle. Move nvme_uninit_ctrl after we are done with the controller device. Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues quiescing/unquiescing respects the submission path rcu grace. Also make sure to kick the requeue list when appropriate. Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Marta Rybczynska authored
This patch improves the way the RDMA IB signalling is done by using atomic operations for the signalling variable. This avoids race conditions on sig_count. The signalling interval changes slightly and is now the largest power of two not larger than queue depth / 2. ilog() usage idea by Bart Van Assche. Signed-off-by:
Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org
-
- 04 Jul, 2017 1 commit
-
-
Sagi Grimberg authored
We might have more/less queues once we reconnect/reset. For example due to cpu going online/offline or controller constraints. Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
- 02 Jul, 2017 2 commits
-
-
Sagi Grimberg authored
All transports use either a private cache of controller cap or an on-stack copy, move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
All all transports use the queue_count in exactly the same, so move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-By:
James Smart <james.smart@broadcom.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
- 28 Jun, 2017 2 commits
-
-
Christoph Hellwig authored
NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the Identify controller data structures. Use this NQN for the subsysnqn sysfs attribute by storing it in the nvme_ctrl structure after verifying it. For older controllers we generate a "fake" NQN per non-normative text in the NVMe 1.3 spec. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
No need to differentiate fabrics from pci/loop, also lower it to 32 as we don't really need 256 inflight admin commands. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 15 Jun, 2017 4 commits
-
-
Christoph Hellwig authored
This moves the nvme_reset function from the PCIe driver to common code, renaming it to nvme_reset_ctrl in the process. Additionally a new helper nvme_reset_ctrl_sync is added for the case where we want to wait for the reset. To facilitate that the reset_work work structure is move to the common nvme_ctrl structure and the ->reset_ctrl method is removed. For now the drivers initialize the reset_work with their own callback, but longer term we should move to callouts for specific parts of the reset process and move even more code to the core. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me>
-
Christoph Hellwig authored
Now that we get the tagset passed we can have a single implementation for the I/O and admin queues. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Dan Carpenter authored
We accidentally return ERR_PTR(0) which is NULL. The caller isn't explicitly checking for that but I couldn't immediately spot whether this would lead to a NULL dereference. Anyway, we can fix add an error code easily enough. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
It is not a user option but rather a variable controller attribute. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-