Commits · 461d971215dfb55bcd5f7d040b2b222592040f95 · Kirill Smelkov / linux

27 Aug, 2021 2 commits

Jens Axboe authored Aug 27, 2021

Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.15/drivers

Pull MD changes from Song.

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  raid1: ensure write behind bio has less than BIO_MAX_VECS sectors
  md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard

461d9712

raid1: ensure write behind bio has less than BIO_MAX_VECS sectors · 6607cd31

Guoqing Jiang authored Aug 24, 2021

We can't split write behind bio with more than BIO_MAX_VECS sectors,
otherwise the below call trace was triggered because we could allocate
oversized write behind bio later.

[ 8.097936] bvec_alloc+0x90/0xc0
[ 8.098934] bio_alloc_bioset+0x1b3/0x260
[ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
[ 8.100988] ? __bio_clone_fast+0xa8/0xe0
[ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
[ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
[ 8.104084] submit_bio_noacct+0x139/0x530
[ 8.105127] submit_bio+0x78/0x1d0
[ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
[ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
[ 8.108300] ? do_writepages+0x41/0x100
[ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
[ 8.110406] do_writepages+0x41/0x100
[ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
[ 8.112513] file_write_and_wait_range+0x61/0xb0
[ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
[ 8.114607] __x64_sys_fsync+0x33/0x60
[ 8.115635] do_syscall_64+0x33/0x40
[ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae

Thanks for the comment from Christoph.

[1]. https://bugs.archlinux.org/task/70992

Cc: stable@vger.kernel.org # v5.12+
Reported-by: Jens Stutte <jens@chianterastutte.eu>
Tested-by: Jens Stutte <jens@chianterastutte.eu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>

6607cd31

26 Aug, 2021 1 commit

md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard · 46d4703b

Xiao Ni authored Aug 18, 2021

We are seeing the following warning in raid10_handle_discard.
[  695.110751] =============================
[  695.131439] WARNING: suspicious RCU usage
[  695.151389] 4.18.0-319.el8.x86_64+debug #1 Not tainted
[  695.174413] -----------------------------
[  695.192603] drivers/md/raid10.c:1776 suspicious
rcu_dereference_check() usage!
[  695.225107] other info that might help us debug this:
[  695.260940] rcu_scheduler_active = 2, debug_locks = 1
[  695.290157] no locks held by mkfs.xfs/10186.

In the first loop of function raid10_handle_discard. It already
determines which disk need to handle discard request and add the
rdev reference count rdev->nr_pending. So the conf->mirrors will
not change until all bios come back from underlayer disks. It
doesn't need to use rcu_dereference to get rdev.

Cc: stable@vger.kernel.org
Fixes: d30588b2 ('md/raid10: improve raid10 discard request')
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>

46d4703b

25 Aug, 2021 6 commits

nbd: remove nbd->destroy_complete · 7ee656c3

Christoph Hellwig authored Aug 25, 2021

The nbd->destroy_complete pointer is not really needed. For creating
a device without a specific index we now simplify skip devices marked
NBD_DESTROY_ON_DISCONNECT as there is not much point to reuse them.
For device creation with a specific index there is no real need to
treat the case of a requested but not finished disconnect different
than any other device that is being shutdown, i.e. we can just return
an error, as a slightly different race window would anyway.

Fixes: 6e4df4c6 ("nbd: reduce the nbd_index_mutex scope")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot+2c98885bcd769f56b6d6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-7-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7ee656c3

nbd: only return usable devices from nbd_find_unused · 438cd318

Christoph Hellwig authored Aug 25, 2021

Device marked as NBD_DESTROY_ON_DISCONNECT can and should be skipped
given that they won't survive the disconnect. So skip them and try
to grab a reference directly and just continue if the the devices
is being torn down or created and thus has a zero refcount.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-6-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

438cd318

nbd: set nbd->index before releasing nbd_index_mutex · b190300d

Tetsuo Handa authored Aug 25, 2021

Set nbd->index before releasing nbd_index_mutex, as populate_nbd_status()
might access nbd->index as soon as nbd_index_mutex is released.

Fixes: 6e4df4c6 ("nbd: reduce the nbd_index_mutex scope")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-5-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b190300d

nbd: prevent IDR lookups from finding partially initialized devices · 75b7f62a

Tetsuo Handa authored Aug 25, 2021

Previously nbd_index_mutex was held during whole add/remove/lookup
operations in order to guarantee that partially initialized devices are
not reachable via idr_find() or idr_for_each(). But now that partially
initialized devices become reachable as soon as idr_alloc() succeeds,
we need to skip partially initialized devices. Since it seems that
all functions use refcount_inc_not_zero(&nbd->refs) in order to skip
destroying devices, update nbd->refs from zero to non-zero as the last
step of device initialization in order to also skip partially initialized
devices.

Fixes: 6e4df4c6 ("nbd: reduce the nbd_index_mutex scope")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[hch: split from a larger patch, added comments]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-4-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75b7f62a

nbd: reset NBD to NULL when restarting in nbd_genl_connect · 409e0ff1

Christoph Hellwig authored Aug 25, 2021

When nbd_genl_connect restarts to wait for a disconnecting device, nbd
needs to be reset to NULL.  Do that by facoring out a helper to find
an unused device.

Fixes: 6177b56c ("nbd: refactor device search and allocation in nbd_genl_connect")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-3-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

409e0ff1

nbd: add missing locking to the nbd_dev_add error path · 93f63bc4

Tetsuo Handa authored Aug 25, 2021

idr_remove needs external synchronization.

Fixes: 6e4df4c6 ("nbd: reduce the nbd_index_mutex scope")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-2-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

93f63bc4

18 Aug, 2021 1 commit

Merge tag 'nvme-5.15-2021-08-18' of git://git.infradead.org/nvme into for-5.15/drivers · ca27f5b5

Jens Axboe authored Aug 18, 2021

Pull NVMe updates from Christoph:

"nvme updates for Linux 5.15.

 - suspend improvements for devices with an HMB (Keith Busch)
 - handle double completions more gacefull (Sagi Grimberg)
 - cleanup the selects for the nvme core code a bit (Sagi Grimberg)
 - don't update queue count when failing to set io queues (Ruozhu Li)
 - various nvmet connect fixes (Amit Engel)
 - cleanup lightnvm leftovers (Keith Busch, me)
 - small cleanups (Colin Ian King, Hou Pu)
 - add tracing for the Set Features command (Hou Pu)
 - CMB sysfs cleanups (Keith Busch)
 - add a mutex_destroy call (Keith Busch)"

* tag 'nvme-5.15-2021-08-18' of git://git.infradead.org/nvme: (21 commits)
  nvme: remove the unused NVME_NS_* enum
  nvme: remove nvm_ndev from ns
  nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
  nvmet: check that host sqsize does not exceed ctrl MQES
  nvmet: avoid duplicate qid in connect cmd
  nvmet: pass back cntlid on successful completion
  nvme-rdma: don't update queue count when failing to set io queues
  nvme-tcp: don't update queue count when failing to set io queues
  nvme-tcp: pair send_mutex init with destroy
  nvme: allow user toggling hmb usage
  nvme-pci: disable hmb on idle suspend
  nvmet: remove redundant assignments of variable status
  nvmet: add set feature tracing support
  nvme: add set feature tracing support
  nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_options
  nvme-pci: cmb sysfs: one file, one value
  nvme-pci: use attribute group for cmb sysfs
  nvme: code command_id with a genctr for use-after-free validation
  nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data
  nvme-pci: limit maximum queue depth to 4095
  ...

ca27f5b5

17 Aug, 2021 1 commit

nvme: remove the unused NVME_NS_* enum · 9891668e

Christoph Hellwig authored Aug 16, 2021

These values are unused now that the lightnvm support is gone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>

9891668e

16 Aug, 2021 21 commits

nvme: remove nvm_ndev from ns · 77979058

Keith Busch authored Aug 16, 2021

Now that the lightnvm driver is removed, we don't need a pointer to it's
now non-existent struct.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

77979058

nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers · 0866200e

Sagi Grimberg authored May 25, 2021

Transport drivers need both core and fabrics modules, instead of
selecting both, have the selection transitive such that NVME_FABRICS
selects NVME_CORE and transport drivers select NVME_FABRICS.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

0866200e

block: nbd: add sanity check for first_minor · b1a81163

Pavel Skripkin authored Aug 12, 2021

Syzbot hit WARNING in internal_create_group(). The problem was in
too big disk->first_minor.

disk->first_minor is initialized by value, which comes from userspace
and there wasn't any sanity checks about value correctness. It can cause
duplicate creation of sysfs files/links, because disk->first_minor will
be passed to MKDEV() which causes truncation to byte. Since maximum
minor value is 0xff, let's check if first_minor is correct minor number.

NOTE: the root case of the reported warning was in wrong error handling
in register_disk(), but we can avoid passing knowingly wrong values to
sysfs API, because sysfs error messages can confuse users. For example:
user passed 1048576 as index, but sysfs complains about duplicate
creation of /dev/block/43:0. It's not obvious how 1048576 becomes 0.
Log and reproducer for above example can be found on syzkaller bug
report page.

Link: https://syzkaller.appspot.com/bug?id=03c2ae9146416edf811958d5fd7acfab75b143d1
Fixes: b0d9111a ("nbd: use an idr to keep track of nbd devices")
Reported-by: syzbot+9937dc42271cd87d4b98@syzkaller.appspotmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b1a81163

nvmet: check that host sqsize does not exceed ctrl MQES · e19e9f47

Amit Engel authored Aug 05, 2021

Check that host sqsize is not greater-than Maximum Queue Entries
Supported (MQES) value supported by the controller.
Signed-off-by: Amit Engel <amit.engel@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e19e9f47

nvmet: avoid duplicate qid in connect cmd · b71df126

Amit Engel authored Aug 08, 2021

According to the NVMe specification, if the host sends a Connect command
specifying a queue id which has already been created, a status value of
NVME_SC_CMD_SEQ_ERROR is returned.
Signed-off-by: Amit Engel <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

b71df126

nvmet: pass back cntlid on successful completion · e804d5ab

Amit Engel authored Aug 08, 2021

According to the NVMe specification, the response dword 0 value of the
Connect command is based on status code: return cntlid for successful
compeltion return IPO and IATTR for connect invalid parameters.  Fix
a missing error information for a zero sized queue, and return the
cntlid also for I/O queue Connect commands.
Signed-off-by: Amit Engel <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e804d5ab

nvme-rdma: don't update queue count when failing to set io queues · 85032874

Ruozhu Li authored Jul 28, 2021

We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.
Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

85032874

nvme-tcp: don't update queue count when failing to set io queues · 664227fd

Ruozhu Li authored Aug 07, 2021

We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.
Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

664227fd

nvme-tcp: pair send_mutex init with destroy · d48f92cd

Keith Busch authored Aug 06, 2021

Each mutex_init() should have a corresponding mutex_destroy().
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

d48f92cd

nvme: allow user toggling hmb usage · a5df5e79

Keith Busch authored Jul 27, 2021

The NVMe host memory buffer may consume a non-negligable amount of
memory. Controllers are required to function without the host memory
buffer enabled, but with possibly degraded performance. Export a sysfs
property to toggle this feature on a per-device granularity so users may
choose to reclaim memory at the expense of storage performance.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

a5df5e79

nvme-pci: disable hmb on idle suspend · e5ad96f3

Keith Busch authored Jul 27, 2021

An idle suspend may or may not disable host memory access from devices
placed in low power mode. Either way, it should always be safe to
disable the host memory buffer prior to entering the low power mode, and
this should also always be faster than a full device shutdown.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e5ad96f3

nvmet: remove redundant assignments of variable status · ad0e9a80

Colin Ian King authored Jul 06, 2021

There are two occurrances where variable status is being assigned a
value that is never read and it is being re-assigned a new value
almost immediately afterwards on an error exit path. The assignments
are redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

ad0e9a80

nvmet: add set feature tracing support · 8d84f9de

Hou Pu authored Jul 05, 2021

A nvme connect command produces following trace from the target side.

Before:
kworker/0:1H-56 [000] .... 9012.155139: nvmet_req_init: nvmet1: qid=0, cmdid=16, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
kworker/0:1H-56 [000] .... 9012.872272: nvmet_req_init: nvmet1: qid=0, cmdid=13, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

cmdline:/sys/kernel/debug/tracing# cat trace | grep feature
kworker/0:1H-56 [000] .... 203.493914: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0x7, sv=0x0, cdw11=0x70007)
kworker/0:1H-56 [000] .... 204.197079: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0xb, sv=0x0, cdw11=0x900)

Using ',' to separate different field like others in
nvmet_trace_admin_get_features.
Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

8d84f9de

nvme: add set feature tracing support · a7b5e8d8

Hou Pu authored Jul 05, 2021

A nvme connect command produces following trace.

Before:
/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/5:1H-98      [005] ....  3221.294844: nvme_setup_cmd: nvme0: qid=0, cmdid=25, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
    kworker/4:1H-124     [004] ....  3222.009186: nvme_setup_cmd: nvme0: qid=0, cmdid=17, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

After:
/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/0:1H-253     [000] ....   196.060509: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0x7, sv=0x0, cdw11=0x70007)
    kworker/0:1H-253     [000] ....   196.763947: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0xb, sv=0x0, cdw11=0x900)

Using ',' to separate different field like others in
nvmet_trace_admin_get_features.
Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

a7b5e8d8

nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_options · e23439e9

Hou Pu authored Jul 09, 2021

Opts->host is NULL there. It is checked just before. So remove
nvmf_host_put. It is introduced by commit 59a2f3f0 ("nvme: fix
potential memory leak in option parsing").
Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e23439e9

nvme-pci: cmb sysfs: one file, one value · 1751e97a

Keith Busch authored Jul 16, 2021

An attribute should only be exporting one value as recommended in
Documentation/filesystems/sysfs.rst. Implement CMB attributes this way.
The old attribute will remain for backward compatibility.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

1751e97a

nvme-pci: use attribute group for cmb sysfs · 0521905e

Keith Busch authored Jul 14, 2021

Appending sysfs files to the controller kobject is a bit clunky and
becomes a maintenance problem as more attributes are added. The
attribute group infrastructure handles this better, so use that.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

0521905e

nvme: code command_id with a genctr for use-after-free validation · e7006de6

Sagi Grimberg authored Jun 16, 2021

We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e7006de6

nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data · 3b01a9d0

Sagi Grimberg authored Jun 16, 2021

We already validate it when receiving the c2hdata pdu header
and this is not changing so this is a redundant check.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>

3b01a9d0

nvme-pci: limit maximum queue depth to 4095 · 27453b45

Sagi Grimberg authored Jun 16, 2021

We are going to use the upper 4-bits of the command_id for a generation
counter, so enforce the new queue depth upper limit. As we enforce
both min and max queue depth, use param_set_uint_minmax istead of
open coding it.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>

27453b45

params: lift param_set_uint_minmax to common code · 2a14c9ae

Sagi Grimberg authored Jun 16, 2021

It is a useful helper hence move it to common code so others can enjoy
it.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

2a14c9ae

14 Aug, 2021 1 commit

remove the lightnvm subsystem · 9ea9b9c4

Christoph Hellwig authored Aug 12, 2021

Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts
to produce Open Channel SSDs and never made it into the NVMe spec
proper. They have since been superceeded by NVMe enhancements such
as ZNS support. Remove the support per the deprecation schedule.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.deReviewed-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9ea9b9c4

13 Aug, 2021 7 commits

nbd: reduce the nbd_index_mutex scope · 6e4df4c6

Christoph Hellwig authored Aug 11, 2021

nbd_index_mutex is currently held over add_disk and inside ->open, which
leads to lock order reversals.  Refactor the device creation code path
so that nbd_dev_add is called without nbd_index_mutex lock held and
only takes it for the IDR insertation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-7-hch@lst.de
[axboe: fix whitespace]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6e4df4c6

nbd: refactor device search and allocation in nbd_genl_connect · 6177b56c

Christoph Hellwig authored Aug 11, 2021

Use idr_for_each_entry instead of the awkward callback to find an
existing device for the index == -1 case, and de-duplicate the device
allocation if no existing device was found.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-6-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6177b56c

nbd: return the allocated nbd_device from nbd_dev_add · 7bdc00cf

Christoph Hellwig authored Aug 11, 2021

Return the device we just allocated instead of doing an extra search for
it in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-5-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7bdc00cf

nbd: remove nbd_del_disk · 327b501b

Christoph Hellwig authored Aug 11, 2021

Fold nbd_del_disk and remove the pointless NULL check on ->disk given
that it is always set for a successfully allocated nbd_device structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-4-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

327b501b

nbd: refactor device removal · 3f74e064

Christoph Hellwig authored Aug 11, 2021

Share common code for the synchronous and workqueue based device removal,
and remove the pointless use of refcount_dec_and_mutex_lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-3-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3f74e064

nbd: do del_gendisk() asynchronously for NBD_DESTROY_ON_DISCONNECT · 68c9417b

Hou Tao authored Aug 11, 2021

Now open_mutex is used to synchronize partition operations (e.g,
blk_drop_partitions() and blkdev_reread_part()), however it makes
nbd driver broken, because nbd may call del_gendisk() in nbd_release()
or nbd_genl_disconnect() if NBD_CFLAG_DESTROY_ON_DISCONNECT is enabled,
and deadlock occurs, as shown below:

// AB-BA dead-lock
nbd_genl_disconnect            blkdev_open
  nbd_disconnect_and_put
                                 lock bd_mutex
  // last ref
  nbd_put
    lock nbd_index_mutex
      del_gendisk
                                   nbd_open
                                     try lock nbd_index_mutex
        try lock bd_mutex

 or

// AA dead-lock
nbd_release
  lock bd_mutex
    nbd_put
      try lock bd_mutex

Instead of fixing block layer (e.g, introduce another lock), fixing
the nbd driver to call del_gendisk() in a kworker when
NBD_DESTROY_ON_DISCONNECT is enabled. When NBD_DESTROY_ON_DISCONNECT
is disabled, nbd device will always be destroy through module removal,
and there is no risky of deadlock.

To ensure the reuse of nbd index succeeds, moving the calling of
idr_remove() after del_gendisk(), so if the reused index is not found
in nbd_index_idr, the old disk must have been deleted. And reusing
the existing destroy_complete mechanism to ensure nbd_genl_connect()
will wait for the completion of del_gendisk().

Also adding a new workqueue for nbd removal, so nbd_cleanup()
can ensure all removals complete before exits.

Reported-by: syzbot+0fe7752e52337864d29b@syzkaller.appspotmail.com
Fixes: c76f48eb ("block: take bd_mutex around delete_partitions in del_gendisk")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-2-hch@lst.deReviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

68c9417b

nbd: add the check to prevent overflow in __nbd_ioctl() · fad7cd33

Baokun Li authored Aug 04, 2021

If user specify a large enough value of NBD blocks option, it may trigger
signed integer overflow which may lead to nbd->config->bytesize becomes a
large or small value, zero in particular.

UBSAN: Undefined behaviour in drivers/block/nbd.c:325:31
signed integer overflow:
1024 * 4611686155866341414 cannot be represented in type 'long long int'
[...]
Call trace:
[...]
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
 nbd_size_set drivers/block/nbd.c:325 [inline]
 __nbd_ioctl drivers/block/nbd.c:1342 [inline]
 nbd_ioctl+0x998/0xa10 drivers/block/nbd.c:1395
 __blkdev_driver_ioctl block/ioctl.c:311 [inline]
[...]

Although it is not a big deal, still silence the UBSAN by limit
the input value.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210804021212.990223-1-libaokun1@huawei.com
[axboe: dropped unlikely()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fad7cd33