Commits · 88e41cf928a6e1a0eb5a9492e2d091ec6193cce4 · Kirill Smelkov / linux

11 Apr, 2021 40 commits

io_uring: add multishot mode for IORING_OP_POLL_ADD · 88e41cf9

Jens Axboe authored Feb 22, 2021

The default io_uring poll mode is one-shot, where once the event triggers,
the poll command is completed and won't trigger any further events. If
we're doing repeated polling on the same file or socket, then it can be
more efficient to do multishot, where we keep triggering whenever the
event becomes true.

This deviates from the usual norm of having one CQE per SQE submitted. Add
a CQE flag, IORING_CQE_F_MORE, which tells the application to expect
further completion events from the submitted SQE. Right now the only user
of this is POLL_ADD in multishot mode.

Since sqe->poll_events is using the space that we normally use for adding
flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot
mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An
application should expect more CQEs for the specificed SQE if the CQE is
flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an
error will terminate the poll request, in which case the flag will be
cleared.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

88e41cf9

io_uring: include cflags in completion trace event · 7471e1af

Jens Axboe authored Feb 22, 2021

We should be including the completion flags for better introspection on
exactly what completion event was logged.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7471e1af

io_uring: allocate memory for overflowed CQEs · 6c2450ae

Pavel Begunkov authored Feb 23, 2021

Instead of using a request itself for overflowed CQE stashing, allocate a
separate entry. The disadvantage is that the allocation may fail and it
will be accounted as lost (see rings->cq_overflow), so we lose reliability
in case of memory pressure if the application is driving the CQ ring into
overflow. However, it opens a way for for multiple CQEs per an SQE and
even generating SQE-less CQEs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: use GFP_ATOMIC | __GFP_ACCOUNT]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6c2450ae

io_uring: mask in error/nval/hangup consistently for poll · 464dca61

Jens Axboe authored Mar 19, 2021

Instead of masking these in as part of regular POLL_ADD prep, do it in
io_init_poll_iocb(), and include NVAL as that's generally unmaskable,
and RDHUP alongside the HUP that is already set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

464dca61

io_uring: optimise rw complete error handling · 9532b99b

Pavel Begunkov authored Mar 22, 2021

Expect read/write to succeed and create a hot path for this case, in
particular hide all error handling with resubmission under a single
check with the desired result.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9532b99b

io_uring: hide iter revert in resubmit_prep · ab454438

Pavel Begunkov authored Mar 22, 2021

Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into
io_resubmit_prep(), so we don't do heavy revert in hot path, also saves
a couple of checks.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ab454438

io_uring: don't alter iopoll reissue fail ret code · 8c130827

Pavel Begunkov authored Mar 22, 2021

When reissue_prep failed in io_complete_rw_iopoll(), we change return
code to -EIO to prevent io_iopoll_complete() from doing resubmission.
Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and
retain the original return value.

It also removes io_rw_reissue() from io_iopoll_complete() that will be
used later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8c130827

io_uring: optimise kiocb_end_write for !ISREG · 1c98679d

Pavel Begunkov authored Mar 22, 2021

file_end_write() is only for regular files, so the function do a couple
of dereferences to get inode and check for it. However, we already have
REQ_F_ISREG at hand, just use it and inline file_end_write().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1c98679d

io_uring: kill unused REQ_F_NO_FILE_TABLE · 59d70013

Pavel Begunkov authored Mar 22, 2021

current->files are always valid now even for io-wq threads, so kill not
used anymore REQ_F_NO_FILE_TABLE.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

59d70013

io_uring: don't init req->work fully in advance · e1d675df

Pavel Begunkov authored Mar 22, 2021

req->work is mostly unused unless it's punted, and io_init_req() is too
hot for fully initialising it. Fortunately, we can skip init work.next
as it's controlled by io-wq, and can not touch work.flags by moving
everything related into io_prep_async_work(). The only field left is
req->work.creds, but there is nothing can be done, keep maintaining it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e1d675df

io-wq: refactor *_get_acct() · 8418f22a

Pavel Begunkov authored Mar 22, 2021

Extract a helper for io_work_get_acct() and io_wqe_get_acct() to avoid
duplication.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8418f22a

io_uring: remove tctx->sqpoll · 05356d86

Pavel Begunkov authored Mar 22, 2021

struct io_uring_task::sqpoll is not used anymore, kill it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05356d86

io_uring: don't do extra EXITING cancellations · 68207680

Pavel Begunkov authored Mar 22, 2021

io_match_task() matches all requests with PF_EXITING task, even though
those may be valid requests. It was necessary for SQPOLL cancellation,
but now it kills all requests before exiting via
io_uring_cancel_sqpoll(), so it's not needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

68207680

io_uring: don't clear REQ_F_LINK_TIMEOUT · d4729fbd

Pavel Begunkov authored Mar 22, 2021

REQ_F_LINK_TIMEOUT is a hint that to look for linked timeouts to cancel,
we're leaving it even when it's already fired. Hence don't care to clear
it in io_kill_linked_timeout(), it's safe and is called only once.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d4729fbd

io_uring: optimise io_req_task_work_add() · c15b79de

Pavel Begunkov authored Mar 19, 2021

Inline io_task_work_add() into io_req_task_work_add(). They both work
with a request, so keeping them separate doesn't make things much more
clear, but merging allows optimise it. Apart from small wins like not
reading req->ctx or not calculating @notify in the hot path, i.e. with
tctx->task_state set, it avoids doing wake_up_process() for every single
add, but only after actually done task_work_add().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c15b79de

io_uring: abolish old io_put_file() · e1d767f0

Pavel Begunkov authored Mar 19, 2021

io_put_file() doesn't do a good job at generating a good code. Inline
it, so we can check REQ_F_FIXED_FILE first, prioritising FIXED_FILE case
over requests without files, and saving a memory load in that case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e1d767f0

io_uring: optimise io_dismantle_req() fast path · 094bae49

Pavel Begunkov authored Mar 19, 2021

Reshuffle io_dismantle_req() checks to put most of slow path stuff under
a single if.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

094bae49

io_uring: inline io_clean_op()'s fast path · 68fb8979

Pavel Begunkov authored Mar 19, 2021

Inline io_clean_op(), leaving __io_clean_op() but renaming it. This will
be used in following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

68fb8979

io_uring: remove __io_req_task_cancel() · 2593553a

Pavel Begunkov authored Mar 19, 2021

Both io_req_complete_failed() and __io_req_task_cancel() do the same
thing: set failure flag, put both req refs and emit an CQE. The former
one is a bit more advance as it puts req back into a req cache, so make
it to take over __io_req_task_cancel() and remove the last one.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2593553a

io_uring: add helper flushing locked_free_list · dac7a098

Pavel Begunkov authored Mar 19, 2021

Add a new helper io_flush_cached_locked_reqs() that splices
locked_free_list to free_list, and does it right doing all sync and
invariant reinit.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dac7a098

io_uring: refactor io_free_req_deferred() · a05432fb

Pavel Begunkov authored Mar 19, 2021

We don't care about ret value in io_free_req_deferred(), make the code a
bit more concise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a05432fb

io_uring: inline io_put_req and friends · 0d85035a

Pavel Begunkov authored Mar 19, 2021

One big omission is that io_put_req() haven't been marked inline, and at
least gcc 9 doesn't inline it, not to mention that it's really hot and
extra function call is intolerable, especially when it doesn't put a
final ref.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d85035a

io_uring: refactor rsrc refnode allocation · 8dd03afe

Pavel Begunkov authored Mar 19, 2021

There are two problems:
1) we always allocate refnodes in advance and free them if those
haven't been used. It's expensive, takes two allocations, where one of
them is percpu. And it may be pretty common not actually using them.

2) Current API with allocating a refnode and setting some of the fields
is error prone, we don't ever want to have a file node runninng fixed
buffer callback...

Solve both with pre-init/get API. Pre-init just leaves the node for
later if not used, and for get (i.e. io_rsrc_refnode_get()), you need to
explicitly pass all arguments setting callbacks/etc., so it's more
resilient.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8dd03afe

io_uring: refactor io_flush_cached_reqs() · dd78f492

Pavel Begunkov authored Mar 19, 2021

Emphasize that return value of io_flush_cached_reqs() depends on number
of requests in the cache. It looks nicer and might help tools from
false-negative analyses.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dd78f492

io_uring: optimise success case of __io_queue_sqe · 1840038e

Pavel Begunkov authored Mar 19, 2021

Move the case of successfully issued request by doing that check first.
It's not much of a difference, just generates slightly better code for
me.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1840038e

io_uring: inline __io_queue_linked_timeout() · de968c18

Pavel Begunkov authored Mar 19, 2021

Inline __io_queue_linked_timeout(), we don't need it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

de968c18

io_uring: keep io_req_free_batch() call locality · 96670657

Pavel Begunkov authored Mar 19, 2021

Don't do a function call (io_dismantle_req()) in the middle and place it
to near other function calls, otherwise may lead to excessive register
spilling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

96670657

io_uring: optimise tctx node checks/alloc · cf27f3b1

Pavel Begunkov authored Mar 19, 2021

First of all, w need to set tctx->sqpoll only when we add a new entry
into ->xa, so move it from the hot path. Also extract a hot path for
io_uring_add_task_file() as an inline helper.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cf27f3b1

io_uring: optimise io_uring_enter() · 33f993da

Pavel Begunkov authored Mar 19, 2021

Add unlikely annotations, because my compiler pretty much mispredicts
every first check, and apart jumping around in the fast path, it also
generates extra instructions, like in advance setting ret value.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

33f993da

io_uring: don't take ctx refs in task_work handler · 493f3b15

Pavel Begunkov authored Mar 19, 2021

__tctx_task_work() guarantees that ctx won't be killed while running
task_works, so we can remove now unnecessary ctx pinning for internally
armed polling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

493f3b15

io_uring: transform ret == 0 for poll cancelation completions · 45ab03b1

Jens Axboe authored Feb 23, 2021

We can set canceled == true and complete out-of-line, ensure that we catch
that and correctly return -ECANCELED if the poll operation got canceled.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45ab03b1

io_uring: correct comment on poll vs iopoll · b9b0e0d3

Jens Axboe authored Feb 23, 2021

The correct function is io_iopoll_complete(), which deals with completions
of IOPOLL requests, not io_poll_complete().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b9b0e0d3

io_uring: cache async and regular file state for fixed files · 7b29f92d

Jens Axboe authored Mar 12, 2021

We have to dig quite deep to check for particularly whether or not a
file supports a fast-path nonblock attempt. For fixed files, we can do
this lookup once and cache the state instead.

This adds two new bits to track whether we support async read/write
attempt, and lines up the REQ_F_ISREG bit with those two. The file slot
re-uses the last 3 (or 2, for 32-bit) of the file pointer to cache that
state, and then we mask it in when we go and use a fixed file.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b29f92d

io_uring: don't check for io_uring_fops for fixed files · d44f554e

Jens Axboe authored Mar 12, 2021

We don't allow them at registration time, so limit the check for needing
inflight tracking in io_file_get() to the non-fixed path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d44f554e

io_uring: simplify io_sqd_update_thread_idle() · c9dca27d

Pavel Begunkov authored Mar 10, 2021

Use a more comprehensible() max instead of hand coding it with ifs in
io_sqd_update_thread_idle().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c9dca27d

io_uring: switch to atomic_t for io_kiocb reference count · abc54d63

Jens Axboe authored Feb 24, 2021

io_uring manipulates references twice for each request, and hence is very
sensitive to performance of the reference count. This commit borrows a
trick from:

commit f958d7b5
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Apr 11 10:06:20 2019 -0700

    mm: make page ref count overflow check tighter and more explicit

and switches to atomic_t for references, while still retaining overflow
and underflow checks.

This is good for a 2-3% increase in peak IOPS on a single core. Before:

IOPS=2970879, IOS/call=31/31, inflight=128 (128)
IOPS=2952597, IOS/call=31/31, inflight=128 (128)
IOPS=2943904, IOS/call=31/31, inflight=128 (128)
IOPS=2930006, IOS/call=31/31, inflight=96 (96)

and after:

IOPS=3054354, IOS/call=31/31, inflight=128 (128)
IOPS=3059038, IOS/call=31/31, inflight=128 (128)
IOPS=3060320, IOS/call=31/31, inflight=128 (128)
IOPS=3068256, IOS/call=31/31, inflight=96 (96)
Signed-off-by: Jens Axboe <axboe@kernel.dk>

abc54d63

io_uring: wrap io_kiocb reference count manipulation in helpers · de9b4cca

Jens Axboe authored Feb 24, 2021

No functional changes in this patch, just in preparation for handling the
references a bit more efficiently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

de9b4cca

io_uring: simplify io_resubmit_prep() · 179ae0d1

Pavel Begunkov authored Feb 28, 2021

If not for async_data NULL check, io_resubmit_prep() is already an rw
specific version of io_req_prep_async(), but slower because 1) it always
goes through io_import_iovec() even if following io_setup_async_rw() the
result 2) instead of initialising iovec/iter in-place it does it
on-stack and then copies with io_setup_async_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

179ae0d1

io_uring: merge defer_prep() and prep_async() · b7e298d2

Pavel Begunkov authored Feb 28, 2021

Merge two function and do renaming in favour of the second one, it
relays the meaning better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b7e298d2

io_uring: rethink def->needs_async_data · 26f0505a

Pavel Begunkov authored Feb 28, 2021

needs_async_data controls allocation of async_data, and used in two
cases. 1) when async setup requires it (by io_req_prep_async() or
handler themselves), and 2) when op always needs additional space to
operate, like timeouts do.

Opcode preps already don't bother about the second case and do
allocation unconditionally, restrict needs_async_data to the first case
only and rename it into needs_async_setup.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: update for IOPOLL fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26f0505a