- 27 Aug, 2021 2 commits
-
-
Christoph Hellwig authored
Support for cryptoloop has been officially marked broken and deprecated in favor of dm-crypt (which supports the same broken algorithms if needed) in Linux 2.6.4 (released in March 2004), and support for it has been entirely removed from losetup in util-linux 2.23 (released in April 2013). Add a warning and a deprecation schedule. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210827163250.255325-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Dan Carpenter authored
blk_mq_alloc_disk() returns error pointers, it doesn't return NULL so correct the check. Fixes: 262d431f ("pd: use blk_mq_alloc_disk and blk_cleanup_disk") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20210827100023.GB9449@kiliSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 26 Aug, 2021 1 commit
-
-
Jens Axboe authored
This reverts commit fb926032. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/Reported-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 24 Aug, 2021 1 commit
-
-
Bart Van Assche authored
The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com> Fixes: 08a9ad8b ("block/mq-deadline: Add cgroup support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.orgReviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Tested-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 18 Aug, 2021 1 commit
-
-
Ming Lei authored
is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the following check: hctx->fq->flush_rq == req but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because: 1) memory re-order in blk_mq_rq_ctx_init(): rq->mq_hctx = data->hctx; ... refcount_set(&rq->ref, 1); OR 2) tag re-use and ->rqs[] isn't updated with new request. Fix the issue by re-writing is_flush_rq() as: return rq->end_io == flush_end_io; which turns out simpler to follow and immune to data race since we have ordered WRITE rq->end_io and refcount_set(&rq->ref, 1). Fixes: 2e315dc0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Cc: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 17 Aug, 2021 2 commits
-
-
Ming Lei authored
For fixing use-after-free during iterating over requests, we grabbed request's refcount before calling ->fn in commit 2e315dc0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter"). Turns out this way may cause kernel panic when iterating over one flush request: 1) old flush request's tag is just released, and this tag is reused by one new request, but ->rqs[] isn't updated yet 2) the flush request can be re-used for submitting one new flush command, so blk_rq_init() is called at the same time 3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called, flush_rq->end_io may not be updated yet, so NULL pointer dereference is triggered in blk_mq_put_rq_ref(). Fix the issue by calling refcount_set(&flush_rq->ref, 1) after flush_rq->end_io is set. So far the only other caller of blk_rq_init() is scsi_ioctl_reset() in which the request doesn't enter block IO stack and the request reference count isn't used, so the change is safe. Fixes: 2e315dc0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reported-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Tested-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Inside blk_mq_queue_tag_busy_iter() we already grabbed request's refcount before calling ->fn(), so needn't to grab it one more time in blk_mq_check_expired(). Meantime remove extra request expire check in blk_mq_check_expired(). Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 13 Aug, 2021 2 commits
-
-
Xie Yongji authored
There is a race between iterating over requests in nbd_clear_que() and completing requests in recv_work(), which can lead to double completion of a request. To fix it, flush the recv worker before iterating over the requests and don't abort the completed request while iterating. Fixes: 96d97e17 ("nbd: clear_sock on netlink disconnect") Reported-by: Jiang Yadong <jiangyadong@bytedance.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
We run a test that delete and recover devcies frequently(two devices on the same host), and we found that 'active_queues' is super big after a period of time. If device a and device b share a tag set, and a is deleted, then blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there is only one queue that are using the tag set. However, if b is still active, the active_queues of b might never be cleared even if b is deleted. Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 11 Aug, 2021 1 commit
-
-
Tejun Heo authored
This reverts commit 08a9ad8b ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 10 Aug, 2021 1 commit
-
-
Ming Lei authored
blkcg->lock depends on q->queue_lock which may depend on another driver lock required in irq context, one example is dm-thin: Chain exists of: &pool->lock#3 --> &q->queue_lock --> &blkcg->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&blkcg->lock); local_irq_disable(); lock(&pool->lock#3); lock(&q->queue_lock); <Interrupt> lock(&pool->lock#3); Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write(). Cc: Tejun Heo <tj@kernel.org> Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#uSigned-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 06 Aug, 2021 1 commit
-
-
Vincent Fu authored
The kyber ioscheduler calls trace_block_rq_insert() *after* the request is added to the queue but the documentation for trace_block_rq_insert() says that the call should be made *before* the request is added to the queue. Move the tracepoint for the kyber ioscheduler so that it is consistent with the documentation. Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Link: https://lore.kernel.org/r/20210804194913.10497-1-vincent.fu@samsung.com Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 05 Aug, 2021 2 commits
-
-
Bart Van Assche authored
Fix the following kernel-doc warning that appears when building with W=1: block/partitions/ldm.c:31: warning: expecting prototype for ldm(). Prototype was for ldm_debug() instead Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210805173447.3249906-1-bvanassche@acm.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
If queue is dying while iolatency_set_limit() is in progress, blk_get_queue() won't increment the refcount of the queue. However, blk_put_queue() will still decrement the refcount later, which will cause the refcout to be unbalanced. Thus error out in such case to fix the problem. Fixes: 8c772a9b ("blk-iolatency: fix IO hang due to negative inflight counter") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 04 Aug, 2021 3 commits
-
-
https://git.kernel.org/pub/scm/linux/kernel/git/song/mdJens Axboe authored
* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid10: properly indicate failure when ending a failed write request
-
Christoph Hellwig authored
dma_map_bvec already takes bv_offset into account. Fixes: 9b2a2bbbb4d0 ("block: Add n64 cart driver") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Stefan Haberland authored
When new configuration data is obtained after a path event it is stored in the per path array. The old data needs to be freed. The first valid configuration data is also referenced in the device private structure to identify the device. When the old per path configuration data was freed the device still pointed to the already freed data leading to a use after free. Fix by replacing also the device configuration data with the newly obtained one before the old data gets freed. Fixes: 46018121 ("s390/dasd: Store path configuration data during path handling") Cc: stable@vger.kernel.org # 5.11+ Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20210804151800.4031761-2-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 28 Jul, 2021 2 commits
-
-
Christoph Hellwig authored
blkdev_get_no_open acquires a reference to the block_device through the block device inode and then tries to acquire a device model reference to the gendisk. But at this point the disk migh already be freed (although the race is free). Fix this by only freeing the gendisk from the whole device bdevs ->free_inode callback as well. Fixes: 22ae8ce8 ("block: simplify bdev/disk lookup in blkdev_get") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it wants the wq_entry to be always removed whether it ended up waking the task or not. finish_wait() tests whether wq_entry needs removal without grabbing the wait_queue lock and expects the waker to use list_del_init_careful() after all waking operations are complete, which iocg_wake_fn() didn't do. The operation order was wrong and the regular list_del_init() was used. The result is that if a waiter wakes up racing the waker, it can free pop the wq_entry off stack before the waker is still looking at it, which can lead to a backtrace like the following. [7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP ... [7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0 ... [7312084.858314] Call Trace: [7312084.863548] _raw_spin_lock_irqsave+0x22/0x30 [7312084.872605] try_to_wake_up+0x4c/0x4f0 [7312084.880444] iocg_wake_fn+0x71/0x80 [7312084.887763] __wake_up_common+0x71/0x140 [7312084.895951] iocg_kick_waitq+0xe8/0x2b0 [7312084.903964] ioc_rqos_throttle+0x275/0x650 [7312084.922423] __rq_qos_throttle+0x20/0x30 [7312084.930608] blk_mq_make_request+0x120/0x650 [7312084.939490] generic_make_request+0xca/0x310 [7312084.957600] submit_bio+0x173/0x200 [7312084.981806] swap_readpage+0x15c/0x240 [7312084.989646] read_swap_cache_async+0x58/0x60 [7312084.998527] swap_cluster_readahead+0x201/0x320 [7312085.023432] swapin_readahead+0x2df/0x450 [7312085.040672] do_swap_page+0x52f/0x820 [7312085.058259] handle_mm_fault+0xa16/0x1420 [7312085.066620] do_page_fault+0x2c6/0x5c0 [7312085.074459] page_fault+0x2f/0x40 Fix it by switching to list_del_init_careful() and putting it at the end. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rik van Riel <riel@surriel.com> Fixes: 7caa4715 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 27 Jul, 2021 1 commit
-
-
John Garry authored
If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we call blk_mq_sched_free_tags() -> blk_mq_free_rqs(). It is incorrect to do so, as any rqs would have already been freed in the blk_mq_alloc_rqs() call. Fix by calling blk_mq_free_rq_map() only directly. Fixes: 6917ff0b ("blk-mq-sched: refactor scheduler initialization") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 23 Jul, 2021 2 commits
-
-
Wei Shuyu authored
Similar to [1], this patch fixes the same bug in raid10. Also cleanup the comments. [1] commit 2417b986 ("md/raid1: properly indicate failure when ending a failed write request") Cc: stable@vger.kernel.org Fixes: 7cee6d4e ("md/raid10: end bio when the device faulty") Signed-off-by: Wei Shuyu <wsy@dogben.com> Acked-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
-
Tetsuo Handa authored
Commit 6cc8e743 ("loop: scale loop device by introducing per device lock") re-opened a race window for NULL pointer dereference at loop_validate_file() where commit 310ca162 ("block/loop: Use global lock for ioctl() operation.") has closed. Although we need to guarantee that other loop devices will not change during traversal, we can't take remote "struct loop_device"->lo_mutex inside loop_validate_file() in order to avoid AB-BA deadlock. Therefore, introduce a global lock dedicated for loop_validate_file() which is conditionally taken before local "struct loop_device"->lo_mutex is taken. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 6cc8e743 ("loop: scale loop device by introducing per device lock") Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 22 Jul, 2021 1 commit
-
-
git://git.infradead.org/nvmeJens Axboe authored
Pull NVMe fixes from Christoph: "nvme fixes for Linux 5.14: - tracing fix (Keith Busch) - fix multipath head refcounting (Hannes Reinecke) - Write Zeroes vs PI fix (me) - drop a bogus WARN_ON (Zhihao Cheng)" * tag 'nvme-5.14-2021-07-22' of git://git.infradead.org/nvme: nvme: set the PRACT bit when using Write Zeroes with T10 PI nvme: fix nvme_setup_command metadata trace event nvme: fix refcounting imbalance when all paths are down nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
-
- 21 Jul, 2021 4 commits
-
-
Christoph Hellwig authored
When using Write Zeroes on a namespace that has protection information enabled they behavior without the PRACT bit counter-intuitive and will generally lead to validation failures when reading the written blocks. Fix this by always setting the PRACT bit that generates matching PI data on the fly. Fixes: 6e02318e ("nvme: add support for the Write Zeroes command") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
-
Keith Busch authored
The metadata address is set after the trace event, so the trace is not capturing anything useful. Rather than logging the memory address, it's useful to know if the command carries a metadata payload, so change the trace event to log that true/false state instead. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
When the last path to a ns_head drops the current code removes the ns_head from the subsystem list, but will only delete the disk itself if the last reference to the ns_head drops. This is causing an refcounting imbalance eg when applications have a reference to the disk, as then they'll never get notified that the disk is in fact dead. This patch moves the call 'del_gendisk' into nvme_mpath_check_last_path(), ensuring that the disk can be properly removed and applications get the appropriate notifications. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Zhihao Cheng authored
Followling process: nvme_probe nvme_reset_ctrl nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING) queue_work(nvme_reset_wq, &ctrl->reset_work) --------------> nvme_remove nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING) worker_thread process_one_work nvme_reset_work WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING) , which will trigger WARN_ON in nvme_reset_work(): [ 127.534298] WARNING: CPU: 0 PID: 139 at drivers/nvme/host/pci.c:2594 [ 127.536161] CPU: 0 PID: 139 Comm: kworker/u8:7 Not tainted 5.13.0 [ 127.552518] Call Trace: [ 127.552840] ? kvm_sched_clock_read+0x25/0x40 [ 127.553936] ? native_send_call_func_single_ipi+0x1c/0x30 [ 127.555117] ? send_call_function_single_ipi+0x9b/0x130 [ 127.556263] ? __smp_call_single_queue+0x48/0x60 [ 127.557278] ? ttwu_queue_wakelist+0xfa/0x1c0 [ 127.558231] ? try_to_wake_up+0x265/0x9d0 [ 127.559120] ? ext4_end_io_rsv_work+0x160/0x290 [ 127.560118] process_one_work+0x28c/0x640 [ 127.561002] worker_thread+0x39a/0x700 [ 127.561833] ? rescuer_thread+0x580/0x580 [ 127.562714] kthread+0x18c/0x1e0 [ 127.563444] ? set_kthread_struct+0x70/0x70 [ 127.564347] ret_from_fork+0x1f/0x30 The preceding problem can be easily reproduced by executing following script (based on blktests suite): test() { pdev="$(_get_pci_dev_from_blkdev)" sysfs="/sys/bus/pci/devices/${pdev}" for ((i = 0; i < 10; i++)); do echo 1 > "$sysfs/remove" echo 1 > /sys/bus/pci/rescan done } Since the device ctrl could be updated as an non-RESETTING state by repeating probe/remove in userspace (which is a normal situation), we can replace stack dumping WARN_ON with a warnning message. Fixes: 82b057ca ("nvme-pci: fix multiple ctrl removal schedulin") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
-
- 17 Jul, 2021 1 commit
-
-
Oleksandr Natalenko authored
After mq-deadline learned to deal with cgroups, the BLKCG_MAX_POLS value became too small for all the elevators to be registered properly. The following issue is seen: ``` calling bfq_init+0x0/0x8b @ 1 blkcg_policy_register: BLKCG_MAX_POLS too small initcall bfq_init+0x0/0x8b returned -28 after 507 usecs ``` which renders BFQ non-functional. Increase BLKCG_MAX_POLS to allow enough space for everyone. Fixes: 08a9ad8b ("block/mq-deadline: Add cgroup support") Link: https://lore.kernel.org/lkml/8988303.mDXGIdCtx8@natalenko.name/Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210717123328.945810-1-oleksandr@natalenko.nameSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 15 Jul, 2021 4 commits
-
-
Christoph Hellwig authored
xen-blkfront has a weird protocol where close message from the remote side can be delayed, and where hot removals are treated somewhat differently from regular removals, all leading to potential NULL pointer removals, and a del_gendisk from the block device release method, which will deadlock. Fix this by just performing normal hot removals even when the device is opened like all other Linux block drivers. Fixes: c76f48eb ("block: take bd_mutex around delete_partitions in del_gendisk") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20210715141711.1257293-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvmeJens Axboe authored
Pull NVMe fixes from Christoph: "nvme fixes for Linux 5.14 - fix various races in nvme-pci when shutting down just after probing (Casey Chen) - fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)" * tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme: nvme-pci: do not call nvme_dev_remove_admin from nvme_remove nvme-pci: fix multiple races in nvme_setup_io_queues nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
-
Wang Qing authored
We must release the queue before freeing the tagset. Fixes: 4af5f2e0 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk") Reported-and-tested-by: syzbot+9ca43ff47167c0ee3466@syzkaller.appspotmail.com Signed-off-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210706040016.1360412-1-guoqing.jiang@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Guoqing Jiang authored
We must release the queue before freeing the tagset. Fixes: 262d431f ("pd: use blk_mq_alloc_disk and blk_cleanup_disk") Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210706010734.1356066-1-guoqing.jiang@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 13 Jul, 2021 3 commits
-
-
Casey Chen authored
nvme_dev_remove_admin could free dev->admin_q and the admin_tagset while they are being accessed by nvme_dev_disable(), which can be called by nvme_reset_work via nvme_remove_dead_ctrl. Commit cb4bfda6 ("nvme-pci: fix hot removal during error handling") intended to avoid requests being stuck on a removed controller by killing the admin queue. But the later fix c8e9e9b7 ("nvme-pci: unquiesce admin queue on shutdown"), together with nvme_dev_disable(dev, true) right before nvme_dev_remove_admin() could help dispatch requests and fail them early, so we don't need nvme_dev_remove_admin() any more. Fixes: cb4bfda6 ("nvme-pci: fix hot removal during error handling") Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Casey Chen authored
Below two paths could overlap each other if we power off a drive quickly after powering it on. There are multiple races in nvme_setup_io_queues() because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit. nvme_reset_work() nvme_remove() nvme_setup_io_queues() nvme_dev_disable() ... ... A1 clear NVMEQ_ENABLED bit for admin queue lock retry: B1 nvme_suspend_io_queues() A2 pci_free_irq() admin queue B2 nvme_suspend_queue() admin queue A3 pci_free_irq_vectors() nvme_pci_disable() A4 nvme_setup_irqs(); B3 pci_free_irq_vectors() ... unlock A5 queue_request_irq() for admin queue set NVMEQ_ENABLED bit ... nvme_create_io_queues() A6 result = queue_request_irq(); set NVMEQ_ENABLED bit ... fail to allocate enough IO queues: A7 nvme_suspend_io_queues() goto retry If B3 runs in between A1 and A2, it will crash if irqaction haven't been freed by A2. B2 is supposed to free admin queue IRQ but it simply can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit. Fix: combine A1 A2 so IRQ get freed as soon as the NVMEQ_ENABLED bit gets cleared. After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3 is checking irqaction. A3 also could race with B2 if B2 is freeing IRQ while A3 is checking irqaction. Fix: A2 and A3 take lock for mutual exclusion. A3 could race with B3 since they could run free_msi_irqs() in parallel. Fix: A3 takes lock for mutual exclusion. A4 could fail to allocate all needed IRQ vectors if A3 and A4 are interrupted by B3. Fix: A4 takes lock for mutual exclusion. If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL. They are just allocated by A5/A6. Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit. A7 could get chance to pci_free_irq() for certain IO queue while B3 is checking irqaction. Fix: A7 takes lock. nvme_dev->online_queues need to be protected by shutdown_lock. Since it is not atomic, both paths could modify it using its own copy. Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com> Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Prabhakar Kushwaha authored
dev_get_by_name() finds network device by name but it also increases the reference count. If a nvme-tcp queue is present and the network device driver is removed before nvme_tcp, we will face the following continuous log: "kernel:unregister_netdevice: waiting for <eth> to become free. Usage count = 2" And rmmod further halts. Similar case arises during reboot/shutdown with nvme-tcp queue present and both never completes. To fix this, use __dev_get_by_name() which finds network device by name without increasing any reference counter. Fixes: 3ede8f72 ("nvme-tcp: allow selecting the network interface for connections") Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> [hch: remove the ->ndev member entirely] Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 07 Jul, 2021 3 commits
-
-
Yu Kuai authored
We run a test that create millions of cgroups and blkgs, and then trigger blkg_destroy_all(). blkg_destroy_all() will hold spin lock for a long time in such situation. Thus release the lock when a batch of blkgs are destroyed. blkcg_activate_policy() and blkcg_deactivate_policy() might have the same problem, however, as they are basically only called from module init/exit paths, let's leave them alone for now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210707015649.1929797-1-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Chunguang Xu authored
On the IO submission path, blk_account_io_start() may interrupt the system interruption. When the interruption returns, the value of part->stamp may have been updated by other cores, so the time value collected before the interruption may be less than part-> stamp. So when this happens, we should do nothing to make io_ticks more accurate? For kernels less than 5.0, this may cause io_ticks to become smaller, which in turn may cause abnormal ioutil values. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvmeJens Axboe authored
Pull single NVMe fix from Christoph. * 'nvme-5.14' of git://git.infradead.org/nvme: nvme-tcp: can't set sk_user_data without write_lock
-
- 05 Jul, 2021 1 commit
-
-
Maurizio Lombardi authored
The sk_user_data pointer is supposed to be modified only while holding the write_lock "sk_callback_lock", otherwise we could race with other threads and crash the kernel. we can't take the write_lock in nvmet_tcp_state_change() because it would cause a deadlock, but the release_work queue will set the pointer to NULL later so we can simply remove the assignment. Fixes: b5332a9f ("nvmet-tcp: fix incorrect locking in state_change sk callback") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 02 Jul, 2021 1 commit
-
-
Tetsuo Handa authored
Commit 0384264e ("block: pass a gendisk to bdev_disk_changed") changed to pass lo->lo_disk instead of lo->lo_device. Fixes: 0384264e ("block: pass a gendisk to bdev_disk_changed") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/20210702152714.7978-1-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: Jens Axboe <axboe@kernel.dk>
-