Commits · 9b84c629c90374498ab5825dede74a06ea1c775b · Kirill Smelkov / linux

29 Oct, 2021 1 commit

blk-mq-debugfs: Show active requests per queue for shared tags · 9b84c629

John Garry authored Oct 29, 2021

Currently we show the hctx.active value for the per-hctx "active" file.

However this is not maintained for shared tags, and we instead keep a
record of the number active requests per request queue - see commit
f1b49fdc ("blk-mq: Record active_queues_shared_sbitmap per tag_set for
when using shared sbitmap).

Change for the case of shared tags to show the active requests per request
queue by using __blk_mq_active_requests() helper.
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1635496823-33515-1-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9b84c629

28 Oct, 2021 1 commit

block: improve readability of blk_mq_end_request_batch() · 02f7eab0

Jens Axboe authored Oct 28, 2021

It's faster and easier to read if we tolerate cur_hctx being NULL in
the "when to flush" condition. Rename last_hctx to cur_hctx while at it,
as it better describes the role of that variable.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

02f7eab0

27 Oct, 2021 13 commits

virtio-blk: Use blk_validate_block_size() to validate block size · 57a13a5b

Xie Yongji authored Oct 26, 2021

The block layer can't support a block size larger than
page size yet. And a block size that's too small or
not a power of two won't work either. If a misconfigured
device presents an invalid block size in configuration space,
it will result in the kernel crash something like below:

[  506.154324] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  506.160416] RIP: 0010:create_empty_buffers+0x24/0x100
[  506.174302] Call Trace:
[  506.174651]  create_page_buffers+0x4d/0x60
[  506.175207]  block_read_full_page+0x50/0x380
[  506.175798]  ? __mod_lruvec_page_state+0x60/0xa0
[  506.176412]  ? __add_to_page_cache_locked+0x1b2/0x390
[  506.177085]  ? blkdev_direct_IO+0x4a0/0x4a0
[  506.177644]  ? scan_shadow_nodes+0x30/0x30
[  506.178206]  ? lru_cache_add+0x42/0x60
[  506.178716]  do_read_cache_page+0x695/0x740
[  506.179278]  ? read_part_sector+0xe0/0xe0
[  506.179821]  read_part_sector+0x36/0xe0
[  506.180337]  adfspart_check_ICS+0x32/0x320
[  506.180890]  ? snprintf+0x45/0x70
[  506.181350]  ? read_part_sector+0xe0/0xe0
[  506.181906]  bdev_disk_changed+0x229/0x5c0
[  506.182483]  blkdev_get_whole+0x6d/0x90
[  506.183013]  blkdev_get_by_dev+0x122/0x2d0
[  506.183562]  device_add_disk+0x39e/0x3c0
[  506.184472]  virtblk_probe+0x3f8/0x79b [virtio_blk]
[  506.185461]  virtio_dev_probe+0x15e/0x1d0 [virtio]

So let's use a block layer helper to validate the block size.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20211026144015.188-5-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

57a13a5b

loop: Use blk_validate_block_size() to validate block size · af3c570f

Xie Yongji authored Oct 26, 2021

Remove loop_validate_block_size() and use the block layer helper
to validate block size.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20211026144015.188-4-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

af3c570f

nbd: Use blk_validate_block_size() to validate block size · c4318d6c

Xie Yongji authored Oct 26, 2021

Use the block layer helper to validate block size instead
of open coding it.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20211026144015.188-3-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c4318d6c

block: Add a helper to validate the block size · 570b1cac

Xie Yongji authored Oct 26, 2021

There are some duplicated codes to validate the block
size in block drivers. This limitation actually comes
from block layer, so this patch tries to add a new block
layer helper for that.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20211026144015.188-2-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

570b1cac

block: re-flow blk_mq_rq_ctx_init() · c7b84d42

Jens Axboe authored Oct 19, 2021

Now that we have flags passed in, we can do a final re-arrange of the
flow of blk_mq_rq_ctx_init() so we're always writing request in the
order in which it is laid out.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-5-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

c7b84d42

block: prefetch request to be initialized · 92aff191

Jens Axboe authored Oct 19, 2021

Now we have the tags available in __blk_mq_alloc_requests_batch(), we
can start fetching the first request cacheline before calling into the
request initialization.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-4-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

92aff191

block: pass in blk_mq_tags to blk_mq_rq_ctx_init() · fe6134f6

Jens Axboe authored Oct 19, 2021

Instead of getting this from data for every invocation of request
initialization, pass it in as an argument instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-3-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

fe6134f6

block: add rq_flags to struct blk_mq_alloc_data · 56f8da64

Jens Axboe authored Oct 19, 2021

There's a hole here we can use, and it's faster to set this earlier
rather than need to check q->elevator multiple times.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-2-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

56f8da64

block: add async version of bio_set_polled · 842e39b0

Pavel Begunkov authored Oct 27, 2021

If we know that a iocb is async we can optimise bio_set_polled() a bit,
add a new helper bio_set_polled_async().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8fa137885164a5d05fadcff4c3521da8d5a83d00.1635337135.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

842e39b0

block: kill DIO_MULTI_BIO · e71aa913

Pavel Begunkov authored Oct 27, 2021

Now __blkdev_direct_IO() serves only multi-bio I/O, thus remove
not used anymore single bio refcounting optimisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88eb488aae9ed4852a30f3a7132f296f56e43b80.1635337135.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e71aa913

block: kill unused polling bits in __blkdev_direct_IO() · 25d207dc

Pavel Begunkov authored Oct 27, 2021

With addition of __blkdev_direct_IO_async(), __blkdev_direct_IO() now
serves only multio-bio I/O, which we don't poll. Now we can remove
anything related to I/O polling from it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b8c597a6b7ee612df394853bfd24726aee5b898e.1635337135.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

25d207dc

block: avoid extra iter advance with async iocb · 1bb6b810

Pavel Begunkov authored Oct 27, 2021

Nobody cares about iov iterators state if we return -EIOCBQUEUED, so as
the we now have __blkdev_direct_IO_async(), which gets pages only once,
we can skip expensive iov_iter_advance(). It's around 1-2% of all CPU
spent.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a6158edfbfa2ae3bc24aed29a72f035df18fad2f.1635337135.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1bb6b810

block: Add independent access ranges support · a2247f19

Damien Le Moal authored Oct 27, 2021

The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.

This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.

To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.

The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges.  In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.

struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files.  The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.

E.g. for a dual actuator HDD, the user sees:

$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
|   |-- nr_sectors
|   `-- sector
`-- 1
    |-- nr_sectors
    `-- sector

For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.

Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.

The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a2247f19

26 Oct, 2021 1 commit

blk-mq: don't issue request directly in case that current is to be blocked · ff155223

Ming Lei authored Oct 26, 2021

When flushing plug list in case that current will be blocked, we can't
issue request directly because ->queue_rq() may sleep, otherwise scheduler
may complain.

Fixes: dc5fc361 ("block: attempt direct issue of plug list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026082257.2889890-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff155223

25 Oct, 2021 4 commits

sbitmap: silence data race warning · 9f8b93a7

Jens Axboe authored Oct 25, 2021

KCSAN complaints about the sbitmap hint update:

==================================================================
BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear

write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1:
 sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
 blk_mq_put_tag+0x82/0x90
 __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
 blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
 __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
 blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
 lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
 blk_complete_reqs block/blk-mq.c:584 [inline]
 blk_done_softirq+0x69/0x90 block/blk-mq.c:589
 __do_softirq+0x12c/0x26e kernel/softirq.c:558
 run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
 smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
 kthread+0x262/0x280 kernel/kthread.c:319
 ret_from_fork+0x1f/0x30

write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0:
 sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
 blk_mq_put_tag+0x82/0x90
 __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
 blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
 __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
 blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
 lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
 blk_complete_reqs block/blk-mq.c:584 [inline]
 blk_done_softirq+0x69/0x90 block/blk-mq.c:589
 __do_softirq+0x12c/0x26e kernel/softirq.c:558
 run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
 smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
 kthread+0x262/0x280 kernel/kthread.c:319
 ret_from_fork+0x1f/0x30

value changed: 0x00000035 -> 0x00000044

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================

which is a data race, but not an important one. This is just updating the
percpu alloc hint, and the reader of that hint doesn't ever require it to
be valid.

Just annotate it with data_race() to silence this one.

Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9f8b93a7

blk-cgroup: synchronize blkg creation against policy deactivation · 0c9d338c

Yu Kuai authored Oct 20, 2021

Our test reports a null pointer dereference:

[  168.534653] ==================================================================
[  168.535614] Disabling lock debugging due to kernel taint
[  168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  168.537274] #PF: supervisor read access in kernel mode
[  168.537964] #PF: error_code(0x0000) - not-present page
[  168.538667] PGD 0 P4D 0
[  168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
[  168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G    B             5.15.0-rc2-next-202100
[  168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
[  168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
[  168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
[  168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
[  168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
[  168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
[  168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
[  168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
[  168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
[  168.551221] FS:  00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
[  168.552278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
[  168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  168.555888] Call Trace:
[  168.556221]  <TASK>
[  168.556510]  blkg_create+0x1c0/0x8c0
[  168.556989]  blkg_conf_prep+0x574/0x650
[  168.557502]  ? stack_trace_save+0x99/0xd0
[  168.558033]  ? blkcg_conf_open_bdev+0x1b0/0x1b0
[  168.558629]  tg_set_conf.constprop.0+0xb9/0x280
[  168.559231]  ? kasan_set_track+0x29/0x40
[  168.559758]  ? kasan_set_free_info+0x30/0x60
[  168.560344]  ? tg_set_limit+0xae0/0xae0
[  168.560853]  ? do_sys_openat2+0x33b/0x640
[  168.561383]  ? do_sys_open+0xa2/0x100
[  168.561877]  ? __x64_sys_open+0x4e/0x60
[  168.562383]  ? __kasan_check_write+0x20/0x30
[  168.562951]  ? copyin+0x48/0x70
[  168.563390]  ? _copy_from_iter+0x234/0x9e0
[  168.563948]  tg_set_conf_u64+0x17/0x20
[  168.564467]  cgroup_file_write+0x1ad/0x380
[  168.565014]  ? cgroup_file_poll+0x80/0x80
[  168.565568]  ? __mutex_lock_slowpath+0x30/0x30
[  168.566165]  ? pgd_free+0x100/0x160
[  168.566649]  kernfs_fop_write_iter+0x21d/0x340
[  168.567246]  ? cgroup_file_poll+0x80/0x80
[  168.567796]  new_sync_write+0x29f/0x3c0
[  168.568314]  ? new_sync_read+0x410/0x410
[  168.568840]  ? __handle_mm_fault+0x1c97/0x2d80
[  168.569425]  ? copy_page_range+0x2b10/0x2b10
[  168.570007]  ? _raw_read_lock_bh+0xa0/0xa0
[  168.570622]  vfs_write+0x46e/0x630
[  168.571091]  ksys_write+0xcd/0x1e0
[  168.571563]  ? __x64_sys_read+0x60/0x60
[  168.572081]  ? __kasan_check_write+0x20/0x30
[  168.572659]  ? do_user_addr_fault+0x446/0xff0
[  168.573264]  __x64_sys_write+0x46/0x60
[  168.573774]  do_syscall_64+0x35/0x80
[  168.574264]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.574960] RIP: 0033:0x7fac74915130
[  168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
[  168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
[  168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
[  168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
[  168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
[  168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
[  168.583757]  </TASK>
[  168.584063] Modules linked in:
[  168.584494] CR2: 0000000000000008
[  168.584964] ---[ end trace 2475611ad0f77a1a ]---

This is because blkg_alloc() is called from blkg_conf_prep() without
holding 'q->queue_lock', and elevator is exited before blkg_create():

thread 1                            thread 2
blkg_conf_prep
 spin_lock_irq(&q->queue_lock);
 blkg_lookup_check -> return NULL
 spin_unlock_irq(&q->queue_lock);

 blkg_alloc
  blkcg_policy_enabled -> true
  pd = ->pd_alloc_fn
  blkg->pd[i] = pd
                                   blk_mq_exit_sched
                                    bfq_exit_queue
                                     blkcg_deactivate_policy
                                      spin_lock_irq(&q->queue_lock);
                                      __clear_bit(pol->plid, q->blkcg_pols);
                                      spin_unlock_irq(&q->queue_lock);
                                    q->elevator = NULL;
  spin_lock_irq(&q->queue_lock);
   blkg_create
    if (blkg->pd[i])
     ->pd_init_fn -> q->elevator is NULL
  spin_unlock_irq(&q->queue_lock);

Because blkcg_deactivate_policy() requires queue to be frozen, we can
grab q_usage_counter to synchoronize blkg_conf_prep() against
blkcg_deactivate_policy().

Fixes: e21b7a0b ("block, bfq: add full hierarchical scheduling and cgroups support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0c9d338c

block: refactor bio_iov_bvec_set() · fa5fa8ec

Pavel Begunkov authored Oct 23, 2021

Combine bio_iov_bvec_set() and bio_iov_bvec_set_append() and let the
caller to do iov_iter_advance(). Also get rid of __bio_iov_bvec_set(),
which was duplicated in the final binary, and replace a weird
iov_iter_truncate() of a temporal iter copy with min() better reflecting
the intention.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/bcf1ac36fce769a514e19475f3623cd86a1d8b72.1635006010.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fa5fa8ec

block: add single bio async direct IO helper · 54a88eb8

Pavel Begunkov authored Oct 23, 2021

As with __blkdev_direct_IO_simple(), we can implement direct IO more
efficiently if there is only one bio. Add __blkdev_direct_IO_async() and
blkdev_bio_end_io_async(). This patch brings me from 4.45-4.5 MIOPS with
nullblk to 4.7+.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0ae4109b7a6934adede490f84d188d53b97051b.1635006010.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

54a88eb8

23 Oct, 2021 1 commit

sched: make task_struct->plug always defined · 599593a8

Jens Axboe authored Oct 22, 2021

If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
it generally available, so we don't break the compile:

kernel/sched/core.c: In function ‘sched_submit_work’:
kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
 6346 |                 blk_flush_plug(tsk->plug, true);
      |                                   ^~
kernel/sched/core.c: In function ‘io_schedule_prepare’:
kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
 8357 |         if (current->plug)
      |                    ^~
kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
 8358 |                 blk_flush_plug(current->plug, true);
      |                                       ^~
Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: 008f75a2 ("block: cleanup the flush plug helpers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

599593a8

22 Oct, 2021 2 commits

blk-mq-sched: Don't reference queue tagset in blk_mq_sched_tags_teardown() · 8bdf7b3f

John Garry authored Oct 22, 2021

We should not reference the queue tagset in blk_mq_sched_tags_teardown()
(see function comment) for the blk-mq flags, so use the passed flags
instead.

This solves a use-after-free, similarly fixed earlier (and since broken
again) in commit f0c1c4d2 ("blk-mq: fix use-after-free in
blk_mq_exit_sched").
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: e155b0c2 ("blk-mq: Use shared tags for shared sbitmap support")
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1634890340-15432-1-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8bdf7b3f

block: fix req_bio_endio append error handling · 297db731

Pavel Begunkov authored Oct 22, 2021

Shinichiro Kawasaki reports that there is a bug in a recent
req_bio_endio() patch causing problems with zonefs. As Shinichiro
suggested, inverse the condition in zone append path to resemble how it
was before: fail when it's not fully completed.

Fixes: 478eb72b ("block: optimise req_bio_endio()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/344ea4e334aace9148b41af5f2426da38c8aa65a.1634914228.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

297db731

21 Oct, 2021 15 commits

blk-crypto: update inline encryption documentation · 8e9f666a

Eric Biggers authored Oct 18, 2021

Rework most of inline-encryption.rst to be easier to follow, to correct
some information, to add some important details and remove some
unimportant details, and to take into account the renaming from
blk_keyslot_manager to blk_crypto_profile.
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-5-ebiggers@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

8e9f666a

blk-crypto: rename blk_keyslot_manager to blk_crypto_profile · cb77cb5a

Eric Biggers authored Oct 18, 2021

blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots.  It actually does several different things:

  - Contains the crypto capabilities of the device.

  - Provides functions to control the inline encryption hardware.
    Originally these were just for programming/evicting keyslots;
    however, new functionality (hardware-wrapped keys) will require new
    functions here which are unrelated to keyslots.  Moreover,
    device-mapper devices already (ab)use "keyslot_evict" to pass key
    eviction requests to their underlying devices even though
    device-mapper devices don't have any keyslots themselves (so it
    really should be "evict_key", not "keyslot_evict").

  - Sometimes (but not always!) it manages keyslots.  Originally it
    always did, but device-mapper devices don't have keyslots
    themselves, so they use a "passthrough keyslot manager" which
    doesn't actually manage keyslots.  This hack works, but the
    terminology is unnatural.  Also, some hardware doesn't have keyslots
    and thus also uses a "passthrough keyslot manager" (support for such
    hardware is yet to be upstreamed, but it will happen eventually).

Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.

This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name.  However it's still a fairly
straightforward change, as it doesn't change any actual functionality.

Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

cb77cb5a

blk-crypto: rename keyslot-manager files to blk-crypto-profile · 1e8d44bd

Eric Biggers authored Oct 18, 2021

In preparation for renaming struct blk_keyslot_manager to struct
blk_crypto_profile, rename the keyslot-manager.h and keyslot-manager.c
source files.  Renaming these files separately before making a lot of
changes to their contents makes it easier for git to understand that
they were renamed.

Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-3-ebiggers@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

1e8d44bd

blk-crypto-fallback: properly prefix function and struct names · eebcafae

Eric Biggers authored Oct 18, 2021

For clarity, avoid using just the "blk_crypto_" prefix for functions and
structs that are specific to blk-crypto-fallback.  Instead, use
"blk_crypto_fallback_".  Some places already did this, but others
didn't.

This is also a prerequisite for using "struct blk_crypto_keyslot" to
mean a generic blk-crypto keyslot (which is what it sounds like).
Rename the fallback one to "struct blk_crypto_fallback_keyslot".

No change in behavior.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-2-ebiggers@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

eebcafae

nbd: Use invalidate_disk() helper on disconnect · 435c2acb

Xie Yongji authored Sep 22, 2021

When a nbd device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. Then if this nbd
device's backend is disconnected and another is attached, we will
get back the previous writeback error on fsync, which is unexpected.

To fix it, let's use invalidate_disk() helper to invalidate the
disk on disconnect instead of just setting disk's capacity to zero.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210922123711.187-5-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

435c2acb

loop: Remove the unnecessary bdev checks and unused bdev variable · 19f553db

Xie Yongji authored Sep 22, 2021

The lo->lo_device can't be null if the lo->lo_backing_file is set.
So let's remove the unnecessary bdev checks and the entire bdev
variable in __loop_clr_fd() since the lo->lo_backing_file is already
checked before.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210922123711.187-4-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

19f553db

loop: Use invalidate_disk() helper to invalidate gendisk · e515be8f

Xie Yongji authored Sep 22, 2021

Use invalidate_disk() helper to simplify the code for gendisk
invalidation.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210922123711.187-3-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e515be8f

block: Add invalidate_disk() helper to invalidate the gendisk · f059a1d2

Xie Yongji authored Sep 22, 2021

To hide internal implementation and simplify some driver code,
this adds a helper to invalidate the gendisk. It will clean the
gendisk's associated buffer/page caches and reset its internal
states.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210922123711.187-2-xieyongji@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f059a1d2

block: kill extra rcu lock/unlock in queue enter · e94f6852

Pavel Begunkov authored Oct 21, 2021

blk_try_enter_queue() already takes rcu_read_lock/unlock, so we can
avoid the second pair in percpu_ref_tryget_live(), use a newly added
percpu_ref_tryget_live_rcu().

As rcu_read_lock/unlock imply barrier()s, it's pretty noticeable,
especially for for !CONFIG_PREEMPT_RCU (default for some distributions),
where __rcu_read_lock/unlock() are not inlined.

3.20% io_uring [kernel.vmlinux] [k] __rcu_read_unlock
3.05% io_uring [kernel.vmlinux] [k] __rcu_read_lock

2.52% io_uring [kernel.vmlinux] [k] __rcu_read_unlock
2.28% io_uring [kernel.vmlinux] [k] __rcu_read_lock
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b11c67ea495ed9d44f067622d852de4a510ce65.1634822969.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e94f6852

percpu_ref: percpu_ref_tryget_live() version holding RCU · 3b13c168

Pavel Begunkov authored Oct 21, 2021

Add percpu_ref_tryget_live_rcu(), which is a version of
percpu_ref_tryget_live() but the user is responsible for enclosing it in
a RCU read lock section.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lore.kernel.org/r/3066500d7a6eb3e03f10adf98b87fdb3b1c49db8.1634822969.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3b13c168

block: convert fops.c magic constants to SHIFT_SECTOR · 6549a874

Pavel Begunkov authored Oct 20, 2021

Don't use shifting by a magic number 9 but replace with a more
descriptive SHIFT_SECTOR.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/068782b9f7e97569fb59a99529b23bb17ea4c5e2.1634755800.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6549a874

block: clean up blk_mq_submit_bio() merging · 179ae84f

Pavel Begunkov authored Oct 20, 2021

Combine blk_mq_sched_bio_merge() and blk_attempt_plug_merge() under a
common if, so we don't check it twice.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/daedc90d4029a5d1d73344771632b1faca3aaf81.1634755800.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

179ae84f

block: optimise boundary blkdev_read_iter's checks · 6450fe1f

Pavel Begunkov authored Oct 20, 2021

Combine pos and len checks and mark unlikely. Also, don't reexpand if
it's not truncated.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fff34e613aeaae1ad12977dc4592cb1a1f5d3190.1634755800.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6450fe1f

fs: bdev: fix conflicting comment from lookup_bdev · 057178cf

Jackie Liu authored Oct 21, 2021

We switched to directly use dev_t to get block device, lookup changed the
meaning of use, now we fix this conflicting comment.

Fixes: 4e7b5671 ("block: remove i_bdev")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211021071344.1600362-1-liu.yun@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

057178cf

blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags · 0994c64e

John Garry authored Oct 18, 2021

Since it is now possible for a tagset to share a single set of tags, the
iter function should not re-iter the tags for the count of #hw queues in
that case. Rather it should just iter once.

Fixes: e155b0c2 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1634550083-202815-1-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0994c64e

20 Oct, 2021 2 commits

block: cleanup the flush plug helpers · 008f75a2

Christoph Hellwig authored Oct 20, 2021

Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly. Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

008f75a2

block: optimise blk_flush_plug_list · b600455d

Pavel Begunkov authored Oct 20, 2021

Don't call flush_plug_callbacks if there are no plug callbacks.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b600455d