Commits · b4f664252f51e119e9403ef84b6e9ff36d119510 · Kirill Smelkov / linux

14 Jan, 2021 5 commits

Merge tag 'nvme-5.11-2021-01-14' of git://git.infradead.org/nvme into block-5.11 · b4f66425

Jens Axboe authored Jan 14, 2021

Pull NVMe fixes from Christoph:

"nvme fixes for 5.11:

 - don't initialize hwmon for discover controllers (Sagi Grimberg)
 - fix iov_iter handling in nvme-tcp (Sagi Grimberg)
 - fix a preempt warning in nvme-tcp (Sagi Grimberg)
 - fix a possible NULL pointer dereference in nvme (Israel Rukshin)"

* tag 'nvme-5.11-2021-01-14' of git://git.infradead.org/nvme:
  nvme: don't intialize hwmon for discovery controllers
  nvme-tcp: fix possible data corruption with bio merges
  nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT
  nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY

b4f66425

nvme: don't intialize hwmon for discovery controllers · 5ab25a32

Sagi Grimberg authored Jan 13, 2021

Discovery controllers usually don't support smart log page command.
So when we connect to the discovery controller we see this warning:
nvme nvme0: Failed to read smart log (error 24577)
nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.123.1:8009
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

Introduce a new helper to understand if the controller is a discovery
controller and use this helper to skip nvme_init_hwmon (also use it in
other places that we check if the controller is a discovery controller).

Fixes: 400b6a7b ("nvme: Add hardware monitoring support")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

5ab25a32

nvme-tcp: fix possible data corruption with bio merges · ca1ff67d

Sagi Grimberg authored Jan 13, 2021

When a bio merges, we can get a request that spans multiple
bios, and the overall request payload size is the sum of
all bios. When we calculate how much we need to send
from the existing bio (and bvec), we did not take into
account the iov_iter byte count cap.

Since multipage bvecs support, bvecs can split in the middle
which means that when we account for the last bvec send we
should also take the iov_iter byte count cap as it might be
lower than the last bvec size.
Reported-by: Hao Wang <pkuwangh@gmail.com>
Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Tested-by: Hao Wang <pkuwangh@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

ca1ff67d

nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT · ada83177

Sagi Grimberg authored Jan 13, 2021

We shouldn't call smp_processor_id() in a preemptible
context, but this is advisory at best, so instead
call __smp_processor_id().

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Reported-by: Or Gerlitz <gerlitz.or@gmail.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

ada83177

nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY · 7a846656

Israel Rukshin authored Jan 10, 2021

When setting port traddr to INADDR_ANY, the listening cm_id->device
is NULL. The associate IB device is known only when a connect request
event arrives, so checking T10-PI device capability should be done
at this stage.

Fixes: b09160c3 ("nvmet-rdma: add metadata/T10-PI support")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

7a846656

09 Jan, 2021 5 commits

bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET · 5342fd42

Coly Li authored Jan 04, 2021

If BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET is set in incompat feature
set, it means the cache device is created with obsoleted layout with
obso_bucket_site_hi. Now bcache does not support this feature bit, a new
BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE incompat feature bit is added
for a better layout to support large bucket size.

For the legacy compatibility purpose, if a cache device created with
obsoleted BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit, all bcache
devices attached to this cache set should be set to read-only. Then the
dirty data can be written back to backing device before re-create the
cache device with BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE feature bit
by the latest bcache-tools.

This patch checks BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit
when running a cache set and attach a bcache device to the cache set. If
this bit is set,
- When run a cache set, print an error kernel message to indicate all
  following attached bcache device will be read-only.
- When attach a bcache device, print an error kernel message to indicate
  the attached bcache device will be read-only, and ask users to update
  to latest bcache-tools.

Such change is only for cache device whose bucket size >= 32MB, this is
for the zoned SSD and almost nobody uses such large bucket size at this
moment. If you don't explicit set a large bucket size for a zoned SSD,
such change is totally transparent to your bcache device.

Fixes: ffa47032 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5342fd42

bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket · b16671e8

Coly Li authored Jan 04, 2021

When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
was introduced into the incompat feature set. It used bucket_size_hi
(which was added at the tail of struct cache_sb_disk) to extend current
16bit bucket size to 32bit with existing bucket_size in struct
cache_sb_disk.

This is not a good idea, there are two obvious problems,
- Bucket size is always value power of 2, if store log2(bucket size) in
  existing bucket_size of struct cache_sb_disk, it is unnecessary to add
  bucket_size_hi.
- Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
  struct cache_sb_disk, bucket_size_hi was added after d[] which makes
  csum_set calculate an unexpected super block checksum.

To fix the above problems, this patch introduces a new incompat feature
bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
means bucket_size in struct cache_sb_disk stores the order of power-of-2
bucket size value. When user specifies a bucket size larger than 32768
sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
incompat feature set, and bucket_size stores log2(bucket size) more
than store the real bucket size value.

The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
recognized by kernel driver for legacy compatible purpose. The previous
bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
and not used in bcache-tools anymore.

For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
bcache-tools and kernel driver still recognize the feature string and
display it as "obso_large_bucket".

With this change, the unnecessary extra space extend of bcache on-disk
super block can be avoided, and csum_set() may generate expected check
sum as well.

Fixes: ffa47032 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b16671e8

bcache: check unsupported feature sets for bcache register · 1dfc0686

Coly Li authored Jan 04, 2021

This patch adds the check for features which is incompatible for
current supported feature sets.

Now if the bcache device created by bcache-tools has features that
current kernel doesn't support, read_super() will fail with error
messoage. E.g. if an unsupported incompatible feature detected,
bcache register will fail with dmesg "bcache: register_bcache() error :
Unsupported incompatible feature found".

Fixes: d721a43f ("bcache: increase super block version for cache device and backing device")
Fixes: ffa47032 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1dfc0686

bcache: fix typo from SUUP to SUPP in features.h · f7b4943d

Coly Li authored Jan 04, 2021

This patch fixes the following typos,
from BCH_FEATURE_COMPAT_SUUP to BCH_FEATURE_COMPAT_SUPP
from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_INCOMPAT_SUPP
from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_RO_COMPAT_SUPP

Fixes: d721a43f ("bcache: increase super block version for cache device and backing device")
Fixes: ffa47032 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f7b4943d

bcache: set pdev_set_uuid before scond loop iteration · e8092707

Yi Li authored Jan 04, 2021

There is no need to reassign pdev_set_uuid in the second loop iteration,
so move it to the place before second loop.
Signed-off-by: Yi Li <yili@winhong.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e8092707

08 Jan, 2021 7 commits

blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED · 02f938e9

John Garry authored Jan 08, 2021

Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:

root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

Add the decoding for that flag.

Fixes: 32bc15af ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

02f938e9

block/rnbd-clt: avoid module unload race with close confirmation · 3a21777c

Jack Wang authored Jan 08, 2021

We had kernel panic, it is caused by unload module and last
close confirmation.

call trace:
[1196029.743127]  free_sess+0x15/0x50 [rtrs_client]
[1196029.743128]  rtrs_clt_close+0x4c/0x70 [rtrs_client]
[1196029.743129]  ? rnbd_clt_unmap_device+0x1b0/0x1b0 [rnbd_client]
[1196029.743130]  close_rtrs+0x25/0x50 [rnbd_client]
[1196029.743131]  rnbd_client_exit+0x93/0xb99 [rnbd_client]
[1196029.743132]  __x64_sys_delete_module+0x190/0x260

And in the crashdump confirmation kworker is also running.
PID: 6943   TASK: ffff9e2ac8098000  CPU: 4   COMMAND: "kworker/4:2"
 #0 [ffffb206cf337c30] __schedule at ffffffff9f93f891
 #1 [ffffb206cf337cc8] schedule at ffffffff9f93fe98
 #2 [ffffb206cf337cd0] schedule_timeout at ffffffff9f943938
 #3 [ffffb206cf337d50] wait_for_completion at ffffffff9f9410a7
 #4 [ffffb206cf337da0] __flush_work at ffffffff9f08ce0e
 #5 [ffffb206cf337e20] rtrs_clt_close_conns at ffffffffc0d5f668 [rtrs_client]
 #6 [ffffb206cf337e48] rtrs_clt_close at ffffffffc0d5f801 [rtrs_client]
 #7 [ffffb206cf337e68] close_rtrs at ffffffffc0d26255 [rnbd_client]
 #8 [ffffb206cf337e78] free_sess at ffffffffc0d262ad [rnbd_client]
 #9 [ffffb206cf337e88] rnbd_clt_put_dev at ffffffffc0d266a7 [rnbd_client]

The problem is both code path try to close same session, which lead to
panic.

To fix it, just skip the sess if the refcount already drop to 0.

Fixes: f7a7a5c2 ("block/rnbd: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3a21777c

block/rnbd: Adding name to the Contributors List · ef8048dd

Swapnil Ingle authored Jan 08, 2021

Adding name to the Contributors List
Signed-off-by: Swapnil Ingle <ingleswapnil@gmail.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Acked-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ef8048dd

block/rnbd-clt: Fix sg table use after free · 80f99093

Guoqing Jiang authored Jan 08, 2021

Since dynamically allocate sglist is used for rnbd_iu, we can't free sg
table after send_usr_msg since the callback function (cqe.done) could
still access the sglist.

Otherwise KASAN reports UAF issue:

[ 4856.600257] BUG: KASAN: use-after-free in dma_direct_unmap_sg+0x53/0x290
[ 4856.600772] Read of size 4 at addr ffff888206af3a98 by task swapper/1/0

[ 4856.601729] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G        W         5.10.0-pserver #5.10.0-1+feature+linux+next+20201214.1025+0910d71
[ 4856.601748] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 3.3 02/21/2020
[ 4856.601766] Call Trace:
[ 4856.601785]  <IRQ>
[ 4856.601822]  dump_stack+0x99/0xcb
[ 4856.601856]  ? dma_direct_unmap_sg+0x53/0x290
[ 4856.601888]  print_address_description.constprop.7+0x1e/0x230
[ 4856.601913]  ? freeze_kernel_threads+0x73/0x73
[ 4856.601965]  ? mark_held_locks+0x29/0xa0
[ 4856.602019]  ? dma_direct_unmap_sg+0x53/0x290
[ 4856.602039]  ? dma_direct_unmap_sg+0x53/0x290
[ 4856.602079]  kasan_report.cold.9+0x37/0x7c
[ 4856.602188]  ? mlx5_ib_post_recv+0x430/0x520 [mlx5_ib]
[ 4856.602209]  ? dma_direct_unmap_sg+0x53/0x290
[ 4856.602256]  dma_direct_unmap_sg+0x53/0x290
[ 4856.602366]  complete_rdma_req+0x188/0x4b0 [rtrs_client]
[ 4856.602451]  ? rtrs_clt_close+0x80/0x80 [rtrs_client]
[ 4856.602535]  ? mlx5_ib_poll_cq+0x48b/0x16e0 [mlx5_ib]
[ 4856.602589]  ? radix_tree_insert+0x3a0/0x3a0
[ 4856.602610]  ? do_raw_spin_lock+0x119/0x1d0
[ 4856.602647]  ? rwlock_bug.part.1+0x60/0x60
[ 4856.602740]  rtrs_clt_rdma_done+0x3f7/0x670 [rtrs_client]
[ 4856.602804]  ? rtrs_clt_rdma_cm_handler+0xda0/0xda0 [rtrs_client]
[ 4856.602857]  ? check_flags.part.31+0x6c/0x1f0
[ 4856.602927]  ? rcu_read_lock_sched_held+0xaf/0xe0
[ 4856.602963]  ? rcu_read_lock_bh_held+0xc0/0xc0
[ 4856.603137]  __ib_process_cq+0x10a/0x350 [ib_core]
[ 4856.603309]  ib_poll_handler+0x41/0x1c0 [ib_core]
[ 4856.603358]  irq_poll_softirq+0xe6/0x280
[ 4856.603392]  ? lockdep_hardirqs_on_prepare+0x111/0x210
[ 4856.603446]  __do_softirq+0x10d/0x646
[ 4856.603540]  asm_call_irq_on_stack+0x12/0x20
[ 4856.603563]  </IRQ>

[ 4856.605096] Allocated by task 8914:
[ 4856.605510]  kasan_save_stack+0x19/0x40
[ 4856.605532]  __kasan_kmalloc.constprop.7+0xc1/0xd0
[ 4856.605552]  __kmalloc+0x155/0x320
[ 4856.605574]  __sg_alloc_table+0x155/0x1c0
[ 4856.605594]  sg_alloc_table+0x1f/0x50
[ 4856.605620]  send_msg_sess_info+0x119/0x2e0 [rnbd_client]
[ 4856.605646]  remap_devs+0x71/0x210 [rnbd_client]
[ 4856.605676]  init_sess+0xad8/0xe10 [rtrs_client]
[ 4856.605706]  rtrs_clt_reconnect_work+0xd6/0x170 [rtrs_client]
[ 4856.605728]  process_one_work+0x521/0xa90
[ 4856.605748]  worker_thread+0x65/0x5b0
[ 4856.605769]  kthread+0x1f2/0x210
[ 4856.605789]  ret_from_fork+0x22/0x30

[ 4856.606159] Freed by task 8914:
[ 4856.606559]  kasan_save_stack+0x19/0x40
[ 4856.606580]  kasan_set_track+0x1c/0x30
[ 4856.606601]  kasan_set_free_info+0x1b/0x30
[ 4856.606622]  __kasan_slab_free+0x108/0x150
[ 4856.606642]  slab_free_freelist_hook+0x64/0x190
[ 4856.606661]  kfree+0xe2/0x650
[ 4856.606681]  __sg_free_table+0xa4/0x100
[ 4856.606707]  send_msg_sess_info+0x1d6/0x2e0 [rnbd_client]
[ 4856.606733]  remap_devs+0x71/0x210 [rnbd_client]
[ 4856.606763]  init_sess+0xad8/0xe10 [rtrs_client]
[ 4856.606792]  rtrs_clt_reconnect_work+0xd6/0x170 [rtrs_client]
[ 4856.606813]  process_one_work+0x521/0xa90
[ 4856.606833]  worker_thread+0x65/0x5b0
[ 4856.606853]  kthread+0x1f2/0x210
[ 4856.606872]  ret_from_fork+0x22/0x30

The solution is to free iu's sgtable after the iu is not used anymore.
And also move sg_alloc_table into rnbd_get_iu accordingly.

Fixes: 5a1328d0 ("block/rnbd-clt: Dynamically allocate sglist for rnbd_iu")
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

80f99093

block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close · 1a84e7c6

Jack Wang authored Jan 08, 2021

KASAN detect following BUG:
[  778.215311] ==================================================================
[  778.216696] BUG: KASAN: use-after-free in rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[  778.219037] Read of size 8 at addr ffff88b1d6516c28 by task tee/8842

[  778.220500] CPU: 37 PID: 8842 Comm: tee Kdump: loaded Not tainted 5.10.0-pserver #5.10.0-1+feature+linux+next+20201214.1025+0910d71
[  778.220529] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 3.3 02/21/2020
[  778.220555] Call Trace:
[  778.220609]  dump_stack+0x99/0xcb
[  778.220667]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[  778.220715]  print_address_description.constprop.7+0x1e/0x230
[  778.220750]  ? freeze_kernel_threads+0x73/0x73
[  778.220896]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[  778.220932]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[  778.220994]  kasan_report.cold.9+0x37/0x7c
[  778.221066]  ? kobject_put+0x80/0x270
[  778.221102]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[  778.221184]  rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
[  778.221240]  rnbd_srv_dev_session_force_close_store+0x6a/0xc0 [rnbd_server]
[  778.221304]  ? sysfs_file_ops+0x90/0x90
[  778.221353]  kernfs_fop_write+0x141/0x240
[  778.221451]  vfs_write+0x142/0x4d0
[  778.221553]  ksys_write+0xc0/0x160
[  778.221602]  ? __ia32_sys_read+0x50/0x50
[  778.221684]  ? lockdep_hardirqs_on_prepare+0x13d/0x210
[  778.221718]  ? syscall_enter_from_user_mode+0x1c/0x50
[  778.221821]  do_syscall_64+0x33/0x40
[  778.221862]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  778.221896] RIP: 0033:0x7f4affdd9504
[  778.221928] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
[  778.221956] RSP: 002b:00007fffebb36b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  778.222011] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f4affdd9504
[  778.222038] RDX: 0000000000000002 RSI: 00007fffebb36c50 RDI: 0000000000000003
[  778.222066] RBP: 00007fffebb36c50 R08: 0000556a151aa600 R09: 00007f4affeb1540
[  778.222094] R10: fffffffffffffc19 R11: 0000000000000246 R12: 0000556a151aa520
[  778.222121] R13: 0000000000000002 R14: 00007f4affea6760 R15: 0000000000000002

[  778.222764] Allocated by task 3212:
[  778.223285]  kasan_save_stack+0x19/0x40
[  778.223316]  __kasan_kmalloc.constprop.7+0xc1/0xd0
[  778.223347]  kmem_cache_alloc_trace+0x186/0x350
[  778.223382]  rnbd_srv_rdma_ev+0xf16/0x1690 [rnbd_server]
[  778.223422]  process_io_req+0x4d1/0x670 [rtrs_server]
[  778.223573]  __ib_process_cq+0x10a/0x350 [ib_core]
[  778.223709]  ib_cq_poll_work+0x31/0xb0 [ib_core]
[  778.223743]  process_one_work+0x521/0xa90
[  778.223773]  worker_thread+0x65/0x5b0
[  778.223802]  kthread+0x1f2/0x210
[  778.223833]  ret_from_fork+0x22/0x30

[  778.224296] Freed by task 8842:
[  778.224800]  kasan_save_stack+0x19/0x40
[  778.224829]  kasan_set_track+0x1c/0x30
[  778.224860]  kasan_set_free_info+0x1b/0x30
[  778.224889]  __kasan_slab_free+0x108/0x150
[  778.224919]  slab_free_freelist_hook+0x64/0x190
[  778.224947]  kfree+0xe2/0x650
[  778.224982]  rnbd_destroy_sess_dev+0x2fa/0x3b0 [rnbd_server]
[  778.225011]  kobject_put+0xda/0x270
[  778.225046]  rnbd_srv_sess_dev_force_close+0x30/0x60 [rnbd_server]
[  778.225081]  rnbd_srv_dev_session_force_close_store+0x6a/0xc0 [rnbd_server]
[  778.225111]  kernfs_fop_write+0x141/0x240
[  778.225140]  vfs_write+0x142/0x4d0
[  778.225169]  ksys_write+0xc0/0x160
[  778.225198]  do_syscall_64+0x33/0x40
[  778.225227]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  778.226506] The buggy address belongs to the object at ffff88b1d6516c00
                which belongs to the cache kmalloc-512 of size 512
[  778.227464] The buggy address is located 40 bytes inside of
                512-byte region [ffff88b1d6516c00, ffff88b1d6516e00)

The problem is in the sess_dev release function we call
rnbd_destroy_sess_dev, and could free the sess_dev already, but we still
set the keep_id in rnbd_srv_sess_dev_force_close, which lead to use
after free.

To fix it, move the keep_id before the sysfs removal, and cache the
rnbd_srv_session for lock accessing,

Fixes: 78699805 ("block/rnbd-srv: close a mapped device from server side.")
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1a84e7c6

block/rnbd: Select SG_POOL for RNBD_CLIENT · 74acfa99

Jack Wang authored Jan 08, 2021

lkp reboot following build error:
 drivers/block/rnbd/rnbd-clt.c: In function 'rnbd_softirq_done_fn':
>> drivers/block/rnbd/rnbd-clt.c:387:2: error: implicit declaration of function 'sg_free_table_chained' [-Werror=implicit-function-declaration]
     387 |  sg_free_table_chained(&iu->sgt, RNBD_INLINE_SG_CNT);
         |  ^~~~~~~~~~~~~~~~~~~~~

The reason is CONFIG_SG_POOL is not enabled in the config, to
avoid such failure, select SG_POOL in Kconfig for RNBD_CLIENT.

Fixes: 5a1328d0 ("block/rnbd-clt: Dynamically allocate sglist for rnbd_iu")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

74acfa99

block: pre-initialize struct block_device in bdev_alloc_inode · 2d2f6f1b

Christoph Hellwig authored Jan 07, 2021

bdev_evict_inode and bdev_free_inode are also called for the root inode
of bdevfs, for which bdev_alloc is never called.  Move the zeroing o
f struct block_device and the initialization of the bd_bdi field into
bdev_alloc_inode to make sure they are initialized for the root inode
as well.

Fixes: e6cb5382 ("block: initialize struct block_device in bdev_alloc")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2d2f6f1b

07 Jan, 2021 2 commits

Merge tag 'nvme-5.11-2021-01-07' of git://git.infradead.org/nvme into block-5.11 · 04b1ecb6

Jens Axboe authored Jan 07, 2021

Pull NVMe updates from Christoph:

"nvme updates for 5.11:

 - fix a race in the nvme-tcp send code (Sagi Grimberg)
 - fix a list corruption in an nvme-rdma error path (Israel Rukshin)
 - avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar)
 - add the susystem NQN quirk for a Samsung driver (Gopal Tiwari)
 - fix two compiler warnings in nvme-fcloop (James Smart)
 - don't call sleeping functions from irq context in nvme-fc (James Smart)
 - remove an unused argument (Max Gurtovoy)
 - remove unused exports (Minwoo Im)"

* tag 'nvme-5.11-2021-01-07' of git://git.infradead.org/nvme:
  nvme: remove the unused status argument from nvme_trace_bio_complete
  nvmet-rdma: Fix list_del corruption on queue establishment failure
  nvme: unexport functions with no external caller
  nvme: avoid possible double fetch in handling CQE
  nvme-tcp: Fix possible race of io_work and direct send
  nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
  nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings
  nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context

04b1ecb6

fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb · 04a6a536

Satya Tangirala authored Dec 24, 2020

freeze/thaw_bdev() currently use bdev->bd_fsfreeze_count to infer
whether or not bdev->bd_fsfreeze_sb is valid (it's valid iff
bd_fsfreeze_count is non-zero). thaw_bdev() doesn't nullify
bd_fsfreeze_sb.

But this means a freeze_bdev() call followed by a thaw_bdev() call can
leave bd_fsfreeze_sb with a non-null value, while bd_fsfreeze_count is
zero. If freeze_bdev() is called again, and this time
get_active_super() returns NULL (e.g. because the FS is unmounted),
we'll end up with bd_fsfreeze_count > 0, but bd_fsfreeze_sb is
*untouched* - it stays the same (now garbage) value. A subsequent
thaw_bdev() will decide that the bd_fsfreeze_sb value is legitimate
(since bd_fsfreeze_count > 0), and attempt to use it.

Fix this by always setting bd_fsfreeze_sb to NULL when
bd_fsfreeze_count is successfully decremented to 0 in thaw_sb().
Alternatively, we could set bd_fsfreeze_sb to whatever
get_active_super() returns in freeze_bdev() whenever bd_fsfreeze_count
is successfully incremented to 1 from 0 (which can be achieved cleanly
by moving the line currently setting bd_fsfreeze_sb to immediately
after the "sync:" label, but it might be a little too subtle/easily
overlooked in future).

This fixes the currently panicking xfstests generic/085.

Fixes: 040f04bd ("fs: simplify freeze_bdev/thaw_bdev")
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

04a6a536

06 Jan, 2021 8 commits

nvme: remove the unused status argument from nvme_trace_bio_complete · 2b59787a

Max Gurtovoy authored Jan 05, 2021

The only used argument in this function is the "req".
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

2b59787a

nvmet-rdma: Fix list_del corruption on queue establishment failure · 9ceb7863

Israel Rukshin authored Jan 05, 2021

When a queue is in NVMET_RDMA_Q_CONNECTING state, it may has some
requests at rsp_wait_list. In case a disconnect occurs at this
state, no one will empty this list and will return the requests to
free_rsps list. Normally nvmet_rdma_queue_established() free those
requests after moving the queue to NVMET_RDMA_Q_LIVE state, but in
this case __nvmet_rdma_queue_disconnect() is called before. The
crash happens at nvmet_rdma_free_rsps() when calling
list_del(&rsp->free_list), because the request exists only at
the wait list. To fix the issue, simply clear rsp_wait_list when
destroying the queue.
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

9ceb7863

nvme: unexport functions with no external caller · 9b66fc02

Minwoo Im authored Dec 30, 2020

There are no callers for nvme_reset_ctrl_sync() and
nvme_alloc_request_qid() so that we keep the symbols exported.

Unexport those functions, mark them static and update the header file
respectively.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

9b66fc02

nvme: avoid possible double fetch in handling CQE · 62df8016

Lalithambika Krishnakumar authored Dec 23, 2020

While handling the completion queue, keep a local copy of the command id
from the DMA-accessible completion entry. This silences a time-of-check
to time-of-use (TOCTOU) warning from KF/x[1], with respect to a
Thunderclap[2] vulnerability analysis. The double-read impact appears
benign.

There may be a theoretical window for @command_id to be used as an
adversary-controlled array-index-value for mounting a speculative
execution attack, but that mitigation is saved for a potential follow-on.
A man-in-the-middle attack on the data payload is out of scope for this
analysis and is hopefully mitigated by filesystem integrity mechanisms.

[1] https://github.com/intel/kernel-fuzzer-for-xen-project
[2] http://thunderclap.io/thunderclap-paper-ndss2019.pdfSigned-off-by: Lalithambika Krishna Kumar <lalithambika.krishnakumar@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

62df8016

nvme-tcp: Fix possible race of io_work and direct send · 5c11f7d9

Sagi Grimberg authored Dec 21, 2020

We may send a request (with or without its data) from two paths:

  1. From our I/O context nvme_tcp_io_work which is triggered from:
    - queue_rq
    - r2t reception
    - socket data_ready and write_space callbacks
  2. Directly from queue_rq if the send_list is empty (because we want to
     save the context switch associated with scheduling our io_work).

However, given that now we have the send_mutex, we may run into a race
condition where none of these contexts will send the pending payload to
the controller. Both io_work send path and queue_rq send path
opportunistically attempt to acquire the send_mutex however queue_rq only
attempts to send a single request, and if io_work context fails to
acquire the send_mutex it will complete without rescheduling itself.

The race can trigger with the following sequence:

  1. queue_rq sends request (no incapsule data) and blocks
  2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
     to the send_list and schedules io_work
  3. io_work triggers and cannot acquire the send_mutex - because of (1),
     ends without self rescheduling
  4. queue_rq completes the send, and completes

==> no context will send the h2cdata - timeout.

Fix this by having queue_rq sending as much as it can from the send_list
such that if it still has any left, its because the socket buffer is
full and the socket write_space callback will trigger, thus guaranteeing
that a context will be scheduled to send the h2cdata PDU.

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Reported-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reported-by: Samuel Jones <sjones@kalrayinc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

5c11f7d9

nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN · 7ee5c78c

Gopal Tiwari authored Dec 04, 2020

A system with more than one of these SSDs will only have one usable.
Hence the kernel fails to detect nvme devices due to duplicate cntlids.

[    6.274554] nvme nvme1: Duplicate cntlid 33 with nvme0, rejecting
[    6.274566] nvme nvme1: Removing after probe failure status: -22

Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
Signed-off-by: Gopal Tiwari <gtiwari@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

7ee5c78c

nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings · 2b54996b

James Smart authored Dec 07, 2020

Kernel robot had the following warnings:

>> fcloop.c:1506:6: warning: %x in format string (no. 1) requires
>> 'unsigned int *' but the argument type is 'signed int *'.
>> [invalidScanfArgType_int]
>>    if (sscanf(buf, "%x:%d:%d", &opcode, &starting, &amount) != 3)
>>        ^

Resolve by changing opcode from and int to an unsigned int

and

>>  fcloop.c:1632:32: warning: Uninitialized variable: lport [uninitvar]
>>     ret = __wait_localport_unreg(lport);
>>                                  ^

>>  fcloop.c:1615:28: warning: Uninitialized variable: nport [uninitvar]
>>     ret = __remoteport_unreg(nport, rport);
>>                              ^

These aren't actual issues as the values are assigned prior to use.
It appears the tool doesn't understand list_first_entry_or_null().
Regardless, quiet the tool by initializing the pointers to NULL at
declaration.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

2b54996b

nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context · 19fce047

James Smart authored Dec 01, 2020

Recent patches changed calling sequences. nvme_fc_abort_outstanding_ios
used to be called from a timeout or work context. Now it is being called
in an io completion context, which can be an interrupt handler.
Unfortunately, the abort outstanding ios routine attempts to stop nvme
queues and nested routines that may try to sleep, which is in conflict
with the interrupt handler.

Correct replacing the direct call with a work element scheduling, and the
abort outstanding ios routine will be called in the work element.

Fixes: 95ced8a2 ("nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery")
Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>

19fce047

05 Jan, 2021 3 commits

block: fix use-after-free in disk_part_iter_next · aebf5db9

Ming Lei authored Dec 21, 2020

Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

aebf5db9

bfq: Fix computation of shallow depth · 6d4d2735

Jan Kara authored Dec 10, 2020

BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6d4d2735

blk-iocost: fix NULL iocg deref from racing against initialization · d16baa3f

Tejun Heo authored Jan 05, 2021

When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.

While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.

* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
  up ioc for consistency.

* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
  dereferencing it.

* Explain the danger of NULL iocgs in blk_iocost_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jonathan Lemon <bsd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d16baa3f

03 Jan, 2021 2 commits

lightnvm: select CONFIG_CRC32 · 19cd3403

Arnd Bergmann authored Jan 03, 2021

Without CRC32 support, this fails to link:

arm-linux-gnueabi-ld: drivers/lightnvm/pblk-init.o: in function `pblk_init':
pblk-init.c:(.text+0x2654): undefined reference to `crc32_le'
arm-linux-gnueabi-ld: drivers/lightnvm/pblk-init.o: in function `pblk_exit':
pblk-init.c:(.text+0x2a7c): undefined reference to `crc32_le'

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

19cd3403

block: rsxx: select CONFIG_CRC32 · 36a106a4

Arnd Bergmann authored Jan 03, 2021

Without crc32, the driver fails to link:

arm-linux-gnueabi-ld: drivers/block/rsxx/config.o: in function `rsxx_load_config':
config.c:(.text+0x124): undefined reference to `crc32_le'

Fixes: 8722ff8c ("block: IBM RamSan 70/80 device driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

36a106a4

29 Dec, 2020 2 commits

block: add debugfs stanza for QUEUE_FLAG_NOWAIT · dc304326

Andres Freund authored Dec 28, 2020

This was missed in 021a2446. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a2446
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dc304326

fs: block_dev.c: fix kernel-doc warnings from struct block_device changes · 875b2376

Randy Dunlap authored Dec 28, 2020

Fix new kernel-doc warnings in fs/block_dev.c:

../fs/block_dev.c:1066: warning: Excess function parameter 'whole' description in 'bd_abort_claiming'
../fs/block_dev.c:1837: warning: Function parameter or member 'dev' not described in 'lookup_bdev'

Fixes: 4e7b5671 ("block: remove i_bdev")
Fixes: 37c3fc9a ("block: simplify the block device claiming interface")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

875b2376

27 Dec, 2020 6 commits

Linux 5.11-rc1 · 5c8fe583
Linus Torvalds authored Dec 27, 2020

5c8fe583

proc mountinfo: make splice available again · 14e3e989

Linus Torvalds authored Dec 27, 2020

Since commit 36e2c742 ("fs: don't allow splice read/write without
explicit ops") we've required that file operation structures explicitly
enable splice support, rather than falling back to the default handlers.

Most /proc files use the indirect 'struct proc_ops' to describe their
file operations, and were fixed up to support splice earlier in commits
40be821d..b24c30c6, but the mountinfo files interact with the
VFS directly using their own 'struct file_operations' and got missed as
a result.

This adds the necessary support for splice to work for /proc/*/mountinfo
and friends.
Reported-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reported-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209971
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

14e3e989

Merge tag 'ntb-5.11' of git://github.com/jonmason/ntb · 52cd5f9c

Linus Torvalds authored Dec 27, 2020

Pull NTB fixes from Jon Mason:
 "Bug fix for IDT NTB and Intel NTB LTR management support"

* tag 'ntb-5.11' of git://github.com/jonmason/ntb:
  ntb: intel: add Intel NTB LTR vendor support for gen4 NTB
  ntb: idt: fix error check in ntb_hw_idt.c

52cd5f9c

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 33c148a4

Linus Torvalds authored Dec 27, 2020

Pull crypto fixes from Herbert Xu:
 "Fix a number of autobuild failures due to missing Kconfig
  dependencies"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: qat - add CRYPTO_AES to Kconfig dependencies
  crypto: keembay - Add dependency on HAS_IOMEM
  crypto: keembay - CRYPTO_DEV_KEEMBAY_OCS_AES_SM4 should depend on ARCH_KEEMBAY

33c148a4

Merge tag 'objtool-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cce622ab

Linus Torvalds authored Dec 27, 2020

Pull objtool fix from Ingo Molnar:
 "Fix a segfault that occurs when built with Clang"

* tag 'objtool-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix seg fault with Clang non-section symbols

cce622ab

Merge tag 'locking-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6be5f582

Linus Torvalds authored Dec 27, 2020

Pull locking fixes from Ingo Molnar:
 "Misc fixes/updates:

   - Fix static keys usage in module __init sections

   - Add separate MAINTAINERS entry for static branches/calls

   - Fix lockdep splat with CONFIG_PREEMPTIRQ_EVENTS=y tracing"

* tag 'locking-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  softirq: Avoid bad tracing / lockdep interaction
  jump_label/static_call: Add MAINTAINERS
  jump_label: Fix usage in module __init

6be5f582