Commits · 711be0312df4d350fb5bf1671c132cccae5aaf9a · Kirill Smelkov / linux · GitLab

21 Jan, 2020 40 commits

io_uring: optimise use of ctx->drain_next · 711be031

Pavel Begunkov authored Jan 17, 2020

Move setting ctx->drain_next to the only place it could be set, when it
got linked non-head requests. The same for checking it, it's interesting
only for a head of a link or a non-linked request.

No functional changes here. This removes some code from the common path
and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

711be031

io_uring: add support for probing opcodes · 66f4af93

Jens Axboe authored Jan 16, 2020

The application currently has no way of knowing if a given opcode is
supported or not without having to try and issue one and see if we get
-EINVAL or not. And even this approach is fraught with peril, as maybe
we're getting -EINVAL due to some fields being missing, or maybe it's
just not that easy to issue that particular command without doing some
other leg work in terms of setup first.

This adds IORING_REGISTER_PROBE, which fills in a structure with info
on what it supported or not. This will work even with sparse opcode
fields, which may happen in the future or even today if someone
backports specific features to older kernels.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

66f4af93

io_uring: account fixed file references correctly in batch · 10fef4be

Jens Axboe authored Jan 09, 2020

We can't assume that the whole batch has fixed files in it. If it's a
mix, or none at all, then we can end up doing a ref put that either
messes up accounting, or causes an oops if we have no fixed files at
all.

Also ensure we free requests properly between inflight accounted and
normal requests.

Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases")
Reported-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

10fef4be

io_uring: add opcode to issue trace event · 354420f7

Jens Axboe authored Jan 08, 2020

For some test apps at least, user_data is just zeroes. So it's not a
good way to tell what the command actually is. Add the opcode to the
issue trace point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

354420f7

io_uring: add support for IORING_OP_OPENAT2 · cebdb986

Jens Axboe authored Jan 08, 2020

Add support for the new openat2(2) system call. It's trivial to do, as
we can have openat(2) just be wrapped around it.
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cebdb986

io_uring: remove 'fname' from io_open structure · f8748881

Jens Axboe authored Jan 08, 2020

We only use it internally in the prep functions for both statx and
openat, so we don't need it to be persistent across the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f8748881

io_uring: add 'struct open_how' to the openat request context · c12cedf2

Jens Axboe authored Jan 08, 2020

We'll need this for openat2(2) support, remove flags and mode from
the existing io_open struct.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c12cedf2

io_uring: enable option to only trigger eventfd for async completions · f2842ab5

Jens Axboe authored Jan 08, 2020

If an application is using eventfd notifications with poll to know when
new SQEs can be issued, it's expecting the following read/writes to
complete inline. And with that, it knows that there are events available,
and don't want spurious wakeups on the eventfd for those requests.

This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
IORING_REGISTER_EVENTFD, except it only triggers notifications for events
that happen from async completions (IRQ, or io-wq worker completions).
Any completions inline from the submission itself will not trigger
notifications.
Suggested-by: Mark Papadakis <markuspapadakis@icloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f2842ab5

io_uring: change io_ring_ctx bool fields into bit fields · 69b3e546

Jens Axboe authored Jan 08, 2020

In preparation for adding another one, which would make us spill into
another long (and hence bump the size of the ctx), change them to
bit fields.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

69b3e546

io_uring: file set registration should use interruptible waits · c150368b

Jens Axboe authored Jan 08, 2020

If an application attempts to register a set with unbounded requests
pending, we can be stuck here forever if they don't complete. We can
make this wait interruptible, and just abort if we get signaled.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c150368b

io_uring: Remove unnecessary null check · 96fd84d8

YueHaibing authored Jan 07, 2020

Null check kfree is redundant, so remove it.
This is detected by coccinelle.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

96fd84d8

io_uring: add support for send(2) and recv(2) · fddaface

Jens Axboe authored Jan 04, 2020

This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
recv(2) support.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fddaface

io_uring: remove extra io_wq_current_is_worker() · 2550878f

Pavel Begunkov authored Dec 30, 2019

io_wq workers use io_issue_sqe() to forward sqes and never
io_queue_sqe(). Remove extra check for io_wq_current_is_worker()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2550878f

io_uring: optimise commit_sqring() for common case · caf582c6

Pavel Begunkov authored Dec 30, 2019

It should be pretty rare to not submitting anything when there is
something in the ring. No need to keep heuristics for this case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

caf582c6

io_uring: optimise head checks in io_get_sqring() · ee7d46d9

Pavel Begunkov authored Dec 30, 2019

A user may ask to submit more than there is in the ring, and then
io_uring will submit as much as it can. However, in the last iteration
it will allocate an io_kiocb and immediately free it. It could do
better and adjust @to_submit to what is in the ring.

And since the ring's head is already checked here, there is no need to
do it in the loop, spamming with smp_load_acquire()'s barriers
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ee7d46d9

io_uring: clamp to_submit in io_submit_sqes() · 9ef4f124

Pavel Begunkov authored Dec 30, 2019

Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated
code and prepares for following changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9ef4f124

io_uring: add support for IORING_SETUP_CLAMP · 8110c1a6

Jens Axboe authored Dec 28, 2019

Some applications like to start small in terms of ring size, and then
ramp up as needed. This is a bit tricky to do currently, since we don't
advertise the max ring size.

This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
size exceed what we support, then clamp them at the max values instead
of returning -EINVAL. Since we return the chosen ring sizes after setup,
no further changes are needed on the application side. io_uring already
changes the ring sizes if the application doesn't ask for power-of-two
sizes, for example.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8110c1a6

io_uring: extend batch freeing to cover more cases · c6ca97b3

Jens Axboe authored Dec 28, 2019

Currently we only batch free if fixed files are used, no links, no aux
data, etc. This extends the batch freeing to only exclude the linked
case and fallback case, and make io_free_req_many() handle the other
cases just fine.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c6ca97b3

io_uring: wrap multi-req freeing in struct req_batch · 8237e045

Jens Axboe authored Dec 28, 2019

This cleans up the code a bit, and it allows us to build on top of the
multi-req freeing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8237e045

io_uring: batch getting pcpu references · 2b85edfc

Pavel Begunkov authored Dec 28, 2019

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

~5% throughput boost for a "submit and wait 128 nops" benchmark.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

__io_req_free_empty() -> __io_req_do_free()
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2b85edfc

pcpu_ref: add percpu_ref_tryget_many() · 4e5ef023

Pavel Begunkov authored Dec 28, 2019

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e5ef023

io_uring: add IORING_OP_MADVISE · c1ca757b

Jens Axboe authored Dec 25, 2019

This adds support for doing madvise(2) through io_uring. We assume that
any operation can block, and hence punt everything async. This could be
improved, but hard to make bullet proof. The async punt ensures it's
safe.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c1ca757b

mm: make do_madvise() available internally · db08ca25

Jens Axboe authored Dec 25, 2019

This is in preparation for enabling this functionality through io_uring.
Add a helper that is just exporting what sys_madvise() does, and have the
system call use it.

No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

db08ca25

io_uring: add IORING_OP_FADVISE · 4840e418

Jens Axboe authored Dec 25, 2019

This adds support for doing fadvise through io_uring. We assume that
WILLNEED doesn't block, but that DONTNEED may block.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4840e418

io_uring: allow use of offset == -1 to mean file position · ba04291e

Jens Axboe authored Dec 25, 2019

This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
update) the current file position. This obviously comes with the caveat
that if the application has multiple read/writes in flight, then the
end result will not be as expected. This is similar to threads sharing
a file descriptor and doing IO using the current file position.

Since this feature isn't easily detectable by doing a read or write,
add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
detect presence of this feature.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ba04291e

io_uring: add non-vectored read/write commands · 3a6820f2

Jens Axboe authored Dec 22, 2019

For uses cases that don't already naturally have an iovec, it's easier
(or more convenient) to just use a buffer address + length. This is
particular true if the use case is from languages that want to create
a memory safe abstraction on top of io_uring, and where introducing
the need for the iovec may impose an ownership issue. For those cases,
they currently need an indirection buffer, which means allocating data
just for this purpose.

Add basic read/write that don't require the iovec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3a6820f2

io_uring: improve poll completion performance · e94f141b

Jens Axboe authored Dec 19, 2019

For busy IORING_OP_POLL_ADD workloads, we can have enough contention
on the completion lock that we fail the inline completion path quite
often as we fail the trylock on that lock. Add a list for deferred
completions that we can use in that case. This helps reduce the number
of async offloads we have to do, as if we get multiple completions in
a row, we'll piggy back on to the poll_llist instead of having to queue
our own offload.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e94f141b

io_uring: split overflow state into SQ and CQ side · ad3eb2c8

Jens Axboe authored Dec 18, 2019

We currently check ->cq_overflow_list from both SQ and CQ context, which
causes some bouncing of that cache line. Add separate bits of state for
this instead, so that the SQ side can check using its own state, and
likewise for the CQ side.

This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow
with the CQ state. If we hit an overflow condition, both of these bits
are set. Likewise for overflow flush clear, we clear both bits. For the
fast path of just checking if there's an overflow condition on either
the SQ or CQ side, we can use our own private bit for this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ad3eb2c8

io_uring: add lookup table for various opcode needs · d3656344

Jens Axboe authored Dec 18, 2019

We currently have various switch statements that check if an opcode needs
a file, mm, etc. These are hard to keep in sync as opcodes are added. Add
a struct io_op_def that holds all of this information, so we have just
one spot to update when opcodes are added.

This also enables us to NOT allocate req->io if a deferred command
doesn't need it, and corrects some mistakes we had in terms of what
commands need mm context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3656344

io_uring: remove two unnecessary function declarations · add7b6b8

Jens Axboe authored Dec 18, 2019

__io_free_req() and io_double_put_req() aren't used before they are
defined, so we can kill these two forwards.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

add7b6b8

io_uring: move *queue_link_head() from common path · 32fe525b

Pavel Begunkov authored Dec 17, 2019

Move io_queue_link_head() to links handling code in io_submit_sqe(),
so it wouldn't need extra checks and would have better data locality.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

32fe525b

io_uring: rename prev to head · 9d76377f

Pavel Begunkov authored Dec 17, 2019

Calling "prev" a head of a link is a bit misleading. Rename it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9d76377f

io_uring: add IOSQE_ASYNC · ce35a47a

Jens Axboe authored Dec 17, 2019

io_uring defaults to always doing inline submissions, if at all
possible. But for larger copies, even if the data is fully cached, that
can take a long time. Add an IOSQE_ASYNC flag that the application can
set on the SQE - if set, it'll ensure that we always go async for those
kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
get the concurrency we desire for this case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ce35a47a

io-wq: support concurrent non-blocking work · 895e2ca0

Jens Axboe authored Dec 17, 2019

io-wq assumes that work will complete fast (and not block), so it
doesn't create a new worker when work is enqueued, if we already have
at least one worker running. This is done on the assumption that if work
is running, then it will complete fast.

Add an option to force io-wq to fork a new worker for work queued. This
is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
case, io-wq will create a new worker, even though workers are already
running.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

895e2ca0

io_uring: add support for IORING_OP_STATX · eddc7ef5

Jens Axboe authored Dec 13, 2019

This provides support for async statx(2) through io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eddc7ef5

fs: make two stat prep helpers available · 3934e36f

Jens Axboe authored Dec 14, 2019

To implement an async stat, we need to provide the flags mapping and
the statx user copy. Make them available internally, through
fs/internal.h.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3934e36f

io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c

Jens Axboe authored Dec 09, 2019

We currently fully quiesce the ring before an unregister or update of
the fixed fileset. This is very expensive, and we can be a bit smarter
about this.

Add a percpu refcount for the file tables as a whole. Grab a percpu ref
when we use a registered file, and put it on completion. This is cheap
to do. Upon removal of a file from a set, switch the ref count to atomic
mode. When we hit zero ref on the completion side, then we know we can
drop the previously registered files. When the old files have been
dropped, switch the ref back to percpu mode for normal operation.

Since there's a period between doing the update and the kernel being
done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
same action. The application knows the update has completed when it gets
the CQE for it. Between doing the update and receiving this completion,
the application must continue to use the unregistered fd if submitting
IO on this particular file.

This takes the runtime of test/file-register from liburing from 14s to
about 0.7s.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05f3fb3c

io_uring: add support for IORING_OP_CLOSE · b5dba59e

Jens Axboe authored Dec 11, 2019

This works just like close(2), unsurprisingly. We remove the file
descriptor and post the completion inline, then offload the actual
(potential) last file put to async context.

Mark the async part of this work as uncancellable, as we really must
guarantee that the latter part of the close is run.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5dba59e

io-wq: add support for uncancellable work · 0c9d5ccd

Jens Axboe authored Dec 11, 2019

Not all work can be cancelled, some of it we may need to guarantee
that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
on work that must not be cancelled. Note that the caller work function
must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
IO_WQ_WORK_CANCEL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0c9d5ccd

fs: move filp_close() outside of __close_fd_get_file() · 6e802a4b

Jens Axboe authored Dec 11, 2019

Just one caller of this, and just use filp_close() there manually.
This is important to allow async close/removal of the fd.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6e802a4b