- 18 Oct, 2021 40 commits
-
-
Christoph Hellwig authored
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
'struct bvec_iter' is embedded into 'struct bio', define it as packed so that we can get one extra 4bytes for other uses without expanding bio. 'struct bvec_iter' is often allocated on stack, so making it packed doesn't affect performance. Also I have run io_uring on both nvme/null_blk, and not observe performance effect in this way. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This flags ensures that the pages will not be reused for non-bio allocations before the end of an RCU grace period. With that we can safely use a RCU lookup for bio polling as long as we are fine with occasionally polling the wrong device. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague, the bio flag is specific to the polling implementation, so rename and document it properly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
There is no point in sleeping for the expected I/O completion timeout in the io_uring async polling model as we never poll for a specific I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Switch the boolean spin argument to blk_poll to passing a set of flags instead. This will allow to control polling behavior in a more fine grained way. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de [axboe: adapt to changed io_uring iopoll] Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Move the trivial check into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Merge both functions into their only caller to keep the blk-mq tag to blk_qc_t mapping as private as possible in blk-mq.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Factor the code to do the classic full metal polling out of blk_poll into a separate blk_mq_poll_classic helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Add a helper to get the hctx from a request_queue and cookie, and fold the blk_qc_t_to_queue_num helper into it as no other callers are left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
syscall-level code can't just poke into the details of the poll cookie, which is private information of the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
If an iocb is split into multiple bios we can't poll for both. So don't bother to even try to poll in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
If an iocb is split into multiple bios we can't poll for both. So don't even bother to try to poll in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012111226.760968-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The polling support in the legacy direct-io support is a little crufty. It already doesn't support the asynchronous polling needed for io_uring polling, and is hard to adopt to upcoming changes in the polling interfaces. Given that all the major file systems already use the iomap direct I/O code, just drop the polling support. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Currently we scan the entire plug list, which is potentially very expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with merging enabled, and profiling shows that the plug merge check is the (by far) most expensive thing we're doing: Overhead Command Shared Object Symbol + 20.89% io_uring [kernel.vmlinux] [k] blk_attempt_plug_merge + 4.98% io_uring [kernel.vmlinux] [k] io_submit_sqes + 4.78% io_uring [kernel.vmlinux] [k] blkdev_direct_IO + 4.61% io_uring [kernel.vmlinux] [k] blk_mq_submit_bio Instead of browsing the whole list, just check the previously inserted entry. That is enough for a naive merge check and will catch most cases, and for devices that need full merging, the IO scheduler attached to such devices will do that anyway. The plug merge is meant to be an inexpensive check to avoid getting a request, but if we repeatedly scan the list for every single insert, it is very much not a cheap check. With this patch, the workload instead runs at ~7.0M IOPS, providing a 25% improvement. Disabling merging entirely yields another 5% improvement. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Masahiro Yamada authored
Every object under block/ depends on CONFIG_BLOCK. Move the guard to the top Makefile since there is no point to descend into block/ if CONFIG_BLOCK=n. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210927140000.866249-5-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Masahiro Yamada authored
Move the menu to the relevant place. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210927140000.866249-4-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Masahiro Yamada authored
Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is selected from drivers/md/Kconfig, which is entirely dependent on BLOCK. Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig. Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because COMPAT and PCI are boolean. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210927140000.866249-3-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Masahiro Yamada authored
CONFIG_BLK_CGROUP is a boolean option, that is, its value is 'y' or 'n'. The comparison to 'y' is redundant. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210927140000.866249-2-masahiroy@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Add a blk_mq_get_tags() helper, which uses the new sbitmap API for allocating a batch of tags all at once. This both simplifies the block code for batched allocation, and it is also more efficient than just doing repeated calls into __sbitmap_queue_get(). This reduces the sbitmap overhead in peak runs from ~3% to ~1% and yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single CPU core. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
The block layer tag allocation batching still calls into sbitmap to get each tag, but we can improve on that. Add __sbitmap_queue_get_batch(), which returns a mask of tags all at once, along with an offset for those tags. An example return would be 0xff, where bits 0..7 are set, with tag_offset == 128. The valid tags in this case would be 128..135. A batch is specific to an individual sbitmap_map, hence it cannot be larger than that. The requested number of tags is automatically reduced to the max that can be satisfied with a single map. On failure, 0 is returned. Caller should fall back to single tag allocation at that point/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We already have a blk_mq_need_time_stamp() check in __blk_mq_end_request() to get a timestamp, hide all the statistics accounting under it. It cuts some cycles for requests that don't need stats, and is free otherwise. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bio_truncate is only used in bio.c, so mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c, so move them there. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Mark __bio_try_merge_page static and move it up a bit to avoid the need for a forward declaration. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bio_full is only used in bio.c, so move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Fold bio_cur_bytes into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bio_mergeable is only needed by I/O schedulers, so move it to blk-mq-sched.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bio.h doesn't need any of the definitions from ioprio.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
BIO_DEBUG is always defined, so just switch the two instances to use BUG_ON directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Extract a fast check out of __block_mq_sched_restart() and inline it for performance reasons. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/894abaa0998e5999f2fe18f271e5efdfc2c32bd2.1633781740.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Extract hot paths of __blk_account_io_start() and __blk_account_io_done() into inline functions, so we don't always pay for function calls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b0662a636bd4cc7b4f84c9d0a41efa46a688ef13.1633781740.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Simplify the ioctl path and match the code structure on the compat side. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
These are only used inside of block/. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
With the raw driver gone, there is no modular user left. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We could have a race here, where the request gets freed before we call into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state of the request. Grab the hardware context before inserting the flush. Fixes: 0f38d766 ("blk-mq: cleanup blk_mq_submit_bio") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Move the blk_mq_alloc_data stack allocation only into the branch that actually needs it, and use rq->mq_hctx instead of data.hctx to refer to the hctx. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The newly added loop for the cached requests in __blk_mq_alloc_request is a little too convoluted for my taste, so unwind it a bit. Also rename the function to __blk_mq_alloc_requests now that it can allocate more than a single request. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
The caller typically has a good (or even exact) idea of how many requests it needs to submit. We can make the request/tag allocation a lot more efficient if we just allocate N requests/tags upfront when we queue the first bio from the batch. Provide a new plug start helper that allows the caller to specify how many IOs are expected. This sets plug->nr_ios, and we can use that for smarter request allocation. The plug provides a holding spot for requests, and request allocation will check it before calling into the normal request allocation path. The blk_finish_plug() is called, check if there are unused requests and free them. This should not happen in normal operations. The exception is if we get merging, then we may be left with requests that need freeing when done. This raises the per-core performance on my setup from ~5.8M to ~6.1M IOPS. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Particularly for NVMe with efficient deferred submission for many requests, there are nice benefits to be seen by bumping the default max plug count from 16 to 32. This is especially true for virtualized setups, where the submit part is more expensive. But can be noticed even on native hardware. Reduce the multiple queue factor from 4 to 2, since we're changing the default size. While changing it, move the defines into the block layer private header. These aren't values that anyone outside of the block layer uses, or should use. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-