Commits · 3e08773c3841e9db7a520908cc2b136a77d275ff · Kirill Smelkov / linux

18 Oct, 2021 40 commits

block: switch polling to be bio based · 3e08773c

Christoph Hellwig authored Oct 12, 2021

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

3e08773c

block: define 'struct bvec_iter' as packed · 19416123

Ming Lei authored Oct 12, 2021

'struct bvec_iter' is embedded into 'struct bio', define it as packed
so that we can get one extra 4bytes for other uses without expanding
bio.

'struct bvec_iter' is often allocated on stack, so making it packed
doesn't affect performance. Also I have run io_uring on both
nvme/null_blk, and not observe performance effect in this way.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

19416123

block: use SLAB_TYPESAFE_BY_RCU for the bio slab · 1a7e76e4

Christoph Hellwig authored Oct 12, 2021

This flags ensures that the pages will not be reused for non-bio
allocations before the end of an RCU grace period. With that we can
safely use a RCU lookup for bio polling as long as we are fine with
occasionally polling the wrong device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

1a7e76e4

block: rename REQ_HIPRI to REQ_POLLED · 6ce913fe

Christoph Hellwig authored Oct 12, 2021

Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

6ce913fe

io_uring: don't sleep when polling for I/O · d729cf9a

Christoph Hellwig authored Oct 12, 2021

There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

d729cf9a

block: replace the spin argument to blk_iopoll with a flags argument · ef99b2d3

Christoph Hellwig authored Oct 12, 2021

Switch the boolean spin argument to blk_poll to passing a set of flags
instead.  This will allow to control polling behavior in a more fine
grained way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ef99b2d3

blk-mq: remove blk_qc_t_valid · 28a1ae6b

Christoph Hellwig authored Oct 12, 2021

Move the trivial check into the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

28a1ae6b

blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal · efbabbe1

Christoph Hellwig authored Oct 12, 2021

Merge both functions into their only caller to keep the blk-mq tag to
blk_qc_t mapping as private as possible in blk-mq.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

efbabbe1

blk-mq: factor out a "classic" poll helper · c6699d6f

Christoph Hellwig authored Oct 12, 2021

Factor the code to do the classic full metal polling out of blk_poll into
a separate blk_mq_poll_classic helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

c6699d6f

blk-mq: factor out a blk_qc_to_hctx helper · f70299f0

Christoph Hellwig authored Oct 12, 2021

Add a helper to get the hctx from a request_queue and cookie, and fold
the blk_qc_t_to_queue_num helper into it as no other callers are left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f70299f0

io_uring: fix a layering violation in io_iopoll_req_issued · 30da1b45

Christoph Hellwig authored Oct 12, 2021

syscall-level code can't just poke into the details of the poll cookie,
which is private information of the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

30da1b45

iomap: don't try to poll multi-bio I/Os in __iomap_dio_rw · f79d4749

Christoph Hellwig authored Oct 12, 2021

If an iocb is split into multiple bios we can't poll for both. So don't
bother to even try to poll in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f79d4749

block: don't try to poll multi-bio I/Os in __blkdev_direct_IO · 71fc3f5e

Christoph Hellwig authored Oct 12, 2021

If an iocb is split into multiple bios we can't poll for both. So don't
even bother to try to poll in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

71fc3f5e

direct-io: remove blk_poll support · 94c2ed58

Christoph Hellwig authored Oct 12, 2021

The polling support in the legacy direct-io support is a little crufty.
It already doesn't support the asynchronous polling needed for io_uring
polling, and is hard to adopt to upcoming changes in the polling
interfaces. Given that all the major file systems already use the iomap
direct I/O code, just drop the polling support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

94c2ed58

block: only check previous entry for plug merge attempt · d38a9c04

Jens Axboe authored Oct 14, 2021

Currently we scan the entire plug list, which is potentially very
expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with
merging enabled, and profiling shows that the plug merge check is the
(by far) most expensive thing we're doing:

  Overhead  Command   Shared Object     Symbol
  +   20.89%  io_uring  [kernel.vmlinux]  [k] blk_attempt_plug_merge
  +    4.98%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
  +    4.78%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO
  +    4.61%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio

Instead of browsing the whole list, just check the previously inserted
entry. That is enough for a naive merge check and will catch most cases,
and for devices that need full merging, the IO scheduler attached to
such devices will do that anyway. The plug merge is meant to be an
inexpensive check to avoid getting a request, but if we repeatedly
scan the list for every single insert, it is very much not a cheap
check.

With this patch, the workload instead runs at ~7.0M IOPS, providing
a 25% improvement. Disabling merging entirely yields another 5%
improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d38a9c04

block: move CONFIG_BLOCK guard to top Makefile · 4c928904

Masahiro Yamada authored Sep 27, 2021

Every object under block/ depends on CONFIG_BLOCK.

Move the guard to the top Makefile since there is no point to
descend into block/ if CONFIG_BLOCK=n.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-5-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

4c928904

block: move menu "Partition type" to block/partitions/Kconfig · b8b98a62

Masahiro Yamada authored Sep 27, 2021

Move the menu to the relevant place.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-4-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

b8b98a62

block: simplify Kconfig files · c50fca55

Masahiro Yamada authored Sep 27, 2021

Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is
selected from drivers/md/Kconfig, which is entirely dependent on BLOCK.

Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig.

Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because
COMPAT and PCI are boolean.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-3-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

c50fca55

block: remove redundant =y from BLK_CGROUP dependency · df252bde

Masahiro Yamada authored Sep 27, 2021

CONFIG_BLK_CGROUP is a boolean option, that is, its value is 'y' or 'n'.
The comparison to 'y' is redundant.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-2-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

df252bde

block: improve batched tag allocation · 349302da

Jens Axboe authored Oct 09, 2021

Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().

This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

349302da

sbitmap: add __sbitmap_queue_get_batch() · 9672b0d4

Jens Axboe authored Oct 09, 2021

The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.

An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.

A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.

On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9672b0d4

blk-mq: optimise *end_request non-stat path · 8971a3b7

Pavel Begunkov authored Oct 13, 2021

We already have a blk_mq_need_time_stamp() check in
__blk_mq_end_request() to get a timestamp, hide all the statistics
accounting under it. It cuts some cycles for requests that don't need
stats, and is free otherwise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8971a3b7

block: mark bio_truncate static · 4f7ab09a

Christoph Hellwig authored Oct 12, 2021

bio_truncate is only used in bio.c, so mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

4f7ab09a

block: move bio_get_{first,last}_bvec out of bio.h · ff18d77b

Christoph Hellwig authored Oct 12, 2021

bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c,
so move them there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff18d77b

block: mark __bio_try_merge_page static · 9774b391

Christoph Hellwig authored Oct 12, 2021

Mark __bio_try_merge_page static and move it up a bit to avoid the need
for a forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9774b391

block: move bio_full out of bio.h · 9a6083be

Christoph Hellwig authored Oct 12, 2021

bio_full is only used in bio.c, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9a6083be

block: fold bio_cur_bytes into blk_rq_cur_bytes · b6559d8f

Christoph Hellwig authored Oct 12, 2021

Fold bio_cur_bytes into the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b6559d8f

block: move bio_mergeable out of bio.h · 8addffd6

Christoph Hellwig authored Oct 12, 2021

bio_mergeable is only needed by I/O schedulers, so move it to
blk-mq-sched.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8addffd6

block: don't include <linux/ioprio.h> in <linux/bio.h> · 11d9cab1

Christoph Hellwig authored Oct 12, 2021

bio.h doesn't need any of the definitions from ioprio.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

11d9cab1

block: remove BIO_BUG_ON · 9e8c0d0d

Christoph Hellwig authored Oct 12, 2021

BIO_DEBUG is always defined, so just switch the two instances to use
BUG_ON directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9e8c0d0d

blk-mq: inline hot part of __blk_mq_sched_restart · e9ea1596

Pavel Begunkov authored Oct 09, 2021

Extract a fast check out of __block_mq_sched_restart() and inline it for
performance reasons.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/894abaa0998e5999f2fe18f271e5efdfc2c32bd2.1633781740.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e9ea1596

block: inline hot paths of blk_account_io_*() · be6bfe36

Pavel Begunkov authored Oct 09, 2021

Extract hot paths of __blk_account_io_start() and
__blk_account_io_done() into inline functions, so we don't always pay
for function calls.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b0662a636bd4cc7b4f84c9d0a41efa46a688ef13.1633781740.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

be6bfe36

block: merge block_ioctl into blkdev_ioctl · 8a709512

Christoph Hellwig authored Oct 12, 2021

Simplify the ioctl path and match the code structure on the compat side.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8a709512

block: move the *blkdev_ioctl declarations out of blkdev.h · 84b8514b

Christoph Hellwig authored Oct 12, 2021

These are only used inside of block/.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

84b8514b

block: unexport blkdev_ioctl · fea349b0

Christoph Hellwig authored Oct 12, 2021

With the raw driver gone, there is no modular user left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

fea349b0

block: don't dereference request after flush insertion · 4a60f360

Jens Axboe authored Oct 16, 2021

We could have a race here, where the request gets freed before we call
into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
of the request.

Grab the hardware context before inserting the flush.

Fixes: 0f38d766 ("blk-mq: cleanup blk_mq_submit_bio")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4a60f360

blk-mq: cleanup blk_mq_submit_bio · 0f38d766

Christoph Hellwig authored Oct 12, 2021

Move the blk_mq_alloc_data stack allocation only into the branch
that actually needs it, and use rq->mq_hctx instead of data.hctx
to refer to the hctx.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0f38d766

blk-mq: cleanup and rename __blk_mq_alloc_request · b90cfaed

Christoph Hellwig authored Oct 12, 2021

The newly added loop for the cached requests in __blk_mq_alloc_request
is a little too convoluted for my taste, so unwind it a bit. Also
rename the function to __blk_mq_alloc_requests now that it can allocate
more than a single request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b90cfaed

block: pre-allocate requests if plug is started and is a batch · 47c122e3

Jens Axboe authored Oct 06, 2021

The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.

Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.

The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.

This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

47c122e3

block: bump max plugged deferred size from 16 to 32 · ba0ffdd8

Jens Axboe authored Oct 06, 2021

Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.

Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.

While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ba0ffdd8